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Preface 

What Does Reliability Mean? 

Systems • . . 

The word “reliability” applies to systems that consist of people, machines, and written 
information. 

A system is reliable — that is, has good reliability — if the people who need it can depend 
on it over a reasonable period of time. People can depend on a system if it reasonably satisfies 
their needs. 

People ♦ . . 

The views of the people involved in a system are different and depend on their responsi- 
bilities; some rely on it, others keep it reliable, and others do both. Consider an automatic 
grocery checkout system and the people involved: 

• The owners, who are the buyers 

• The store manager, who is responsible for its operation 

• The clerk, who operates it 

• The repair person, who maintains it in working condition 

• The customer, who buys the products 

Machines . . . 

A grocery checkout system may comprise several types of machines. It has mechanical 
(conveyor belt), electrical (conveyor belt motor, wiring), electronic (grocery and credit card 
scanners, display screen, and cash register), and structural (checkout counter, bag holder) 
parts. 

Written Information ♦ ♦ . 

Several types of written information contribute to the way people rely on a system: 

• The sales literature 

• The specifications 

• The detailed manufacturing drawings 

• The software user’s manual, programs, and procedures 

• The operating instructions 

• The parts and repair manual 

• The inventory control 
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Reliability * . . 


People rely on systems to 

• Do work or provide entertainment 

• Do no unintentional harm to users, bystanders, property, or the environment 

• Be reasonably economical to own and to repair 

• Be safe to store or dispose of 

• Accomplish their purposes without failure 

What Does Reliability Engineering Mean? 

Reliability engineering means accomplishing specific tasks while a system is being 
planned, designed and developed, manufactured, used, and improved. These tasks are not the 
usual engineering and management tasks but are those that ensure that the system meet the 
users’ expectations — not only when it is new but as it ages and requires repeated repairs. 

Why Do We Need Reliability Engineering? 

Technology users have always needed reliability engineering, but it has only developed 
since the 1940’s as a separate discipline. Before the Industrial Revolution, most of the 
reliability details were the individual worker’s responsibility because the machines, prod- 
ucts, and tools were relatively simple. However, shoddy goods were produced — wheels that 
broke easily, fanning implements that were not dependable, lumber that rotted prematurely. 

As technology rapidly changed, systems became large and complex. Companies that 
produce these systems must likewise be large and complex. In such situations, many 
important details that affect reliability are often relegated to a lower priority than completing 
a project on time and at an affordable cost. Among the first to see the need for a separate 
reliability discipline were the telephone and electric power utilities and the military. 
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Chapter 1 

Historical Perspective of Space System Reliability 


Summary 

The NASA Strategic Plan (ref. 1-1) is the backbone of our 
new Strategic Management System, an important aspect of 
which is risk management. Coincident with a decreasing NASA 
budget is the new working environment that demands a better, 
faster, and cheaper way to conduct business. In such an environ- 
ment where risk is considered a knowledge-based resource, 
mission assurance has come to play an important role in our 
understanding of risk. 

Through the years, much of mission assurance has been 
aimed at increasing independent systems engineering and fur- 
ther refining basic design approaches. Now' the time has come 
to direct our attention to managing the risks that come from 
system interactions during a mission. To understand such risks, 
we must bring to bear all the engineering techniques at our 
disposal. Mission assurance engineers are entering the era of 
interaction in which engineering and system engineering must 
work closely to achieve better performance on time and within 
cost. 

A structured risk management approach is critical to a 
successful project. This is nothing new. A risk policy must be 
integral to the program as part of a concurrent engineering 
process, and risk and risk drivers must be monitored through- 
out. Risk may also be managed as a resource: the new way of 
managing better, faster, cheaper programs encompasses 
up-front, knowledge-based risk assessment. The safety and 
mission assurance (S&MA) community can provide valuable 
support as risk management consultants. 


Past Space System Reliability 

Ever since the need for improved reliability in space systems 
was recognized, it has been difficult to establish an identity for 
mission assurance engineering. Attempts to delineate an inde- 
pendent set of tasks for mission assurance engineering in the 


1970’s and 1980’s resulted in the development of applied 
statistics for mission assurance and a large group of tasks for the 
project. Mission failures in a well-developed system come 
from necessary risks that remain in the system for the mission. 
Risk management is the key to mission assurance. The tradi- 
tional tasks of applied statistics, reliability, maintainability, 
system safety, quality assurance, logistics support, human 
factors, software assurance, and system effectiveness for a 
project are still important and should still be performed. 

In the past, mission assurance activities were weakly struc- 
tured. Often they were decoupled from the project planning 
activity. When a project had a problem (e.g., a spacecraft would 
not fit on the launch vehicle adapter ring), the mission assur- 
ance people were involved to help solve it. Often problems 
were caused by poorly communicated overall mission needs, a 
limited data base available to the project, tight funding, and a 
limited launch window. These factors resulted in much risk that 
was not recognized until it happened. The rule-based manage- 
ment method used by NASA recognized risk as a consequence 
and classified four types of payloads: A, B, C, and D. These 
were characterized as high priority, minimum risk; high prior- 
ity, medium risk; medium priority, medium-high risk; and high 
risk, minimum cost. Guidelines for system safety, reliability, 
maintainability and quality assurance (SRM&QA) project 
requirements for class A-D payloads were also spelled out. An 
example is the treatment of single failure points (SFP): class A, 
success-critical SFP’s were not permitted; class B, success- 
critical SFP’s were allowed without a waiver but were mini- 
mized; class C, success critical SFP’s were allowed without a 
formal waiver; class D, the same as class C. 

Often risk came as a consequence of the mission. In an 
attempt to minimize risk, extensive tests and analyses were 
conducted. The residual risk was a consequence of deficiencies 
in the tradable resources of mass, power, cost, performance, 
and schedule. NASA tried to allocate resources, develop the 
system, verify and validate risk, launch the system, and accom- 
plish the mission with minimal risk. Using these methods 
resulted in a few failures. 
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Figure 1-1 .—Distribution of reliability emphasis with respect to calendar years (updated from ref. 6; original figure prepared by Kam L. Wong). 


Various reliability efforts were grouped into categories: 
manufacturing control, design control, reliability methods, 
failure-cause detection, finished item reliability, flaw control, 
and risk management. Figure 1-1 illustrates how these catego- 
ries have been emphasized through the years. The construction 
of figure 1-1 is approximate because its purpose is to identify 
activities, not to classify efforts precisely. Note that specific 
mission assurance activities are changing and that the amount 
of effort expended in these may not be proportional to the 
emphasis given them. A good parts management program is 
always important. The decrease in the use of reliability methods 
does not mean that parts management is unimportant; it only 
reflects that the importance of parts management has been well 
established and that parts management has become a standard 
design control task as part of a project. 

Risk Management in the Revised NASA 

The new NASA handbook on the Management of Major 
Systems and Programs is divided according to the four parts of 
the program life cycle: formulation, approval, implementation, 
and evaluation. It stresses risk management as an integral part 
of project management. The Formulation section defines a risk 


management-risk assessment process and requires that all 
projects use it. All risks must be dispositioned before flight. 

The definition of risk management (ref. 1-2) is “An orga- 
nized, systematic decision-making process that efficiently iden- 
tifies risks, assesses or analyzes risks, and effectively reduces 
or eliminates risks to achieving the program goals.” It also 
explains that effective project management depends on a thor- 
ough understanding of the concept of risk, the principles of risk 
management, and the establishment of a disciplined risk man- 
agement process, which is shown in figure 1-2. The figure also 
explains the risk management plan requirements. A completed 
risk management plan is required at the end of the formulation 
phase and must include risk management responsibilities: 
resources, schedules, and milestones; methodologies: processes 
and tools to be used for risk identification, risk analysis, 
assessment, and mitigation; criteria for categorizing or ranking 
risks according to probability and consequences; the role of 
decisionmaking, formal reviews, and status reporting with 
respect to risk management; and documentation requirements 
for risk management products and actions. 

A new direction for mission assurance engineers should be to 
provide dynamic, synthesizing feedback to those responsible 
for design, manufacturing, and mission operations. The feed- 
back should take the form of identifying and ranking risk. 
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Figure 1-2. — Risk management process (ref. 3). 


determining risk mechanisms, and explaining risk manage- 
ment techniques. Mission assurance and the project should 
work together to achieve mission success. 

The Challenge of NASA’s Brave New 
World 

NASA and many other Government agencies have been 
forced to face a new workplace environment. With the NASA 
budget shrinking, the nature of projects has changed: many are 
fast track and have fixed prices, which means that they must be 
completed in a better, faster, and cheaper manner. The dollars 
once put into facilities are very limited; the spacecraft budgets 
are smaller so the development cycle time has been reduced to 
save money. NASA’s solution to these constraints is to empha- 
size proactive risk management processes. The paradigm has to 
change from rule-based to knowledge-based decisions and new 
methods that will improve productivity. Figure 1-3 shows the 
total NASA Earth and Space Science project budgets that 
reflect the slogan “better, faster, and cheaper.” 


Risk as a Resource 

NASA’s new paradigm (ref. 1-3) requires that risks be 
identified and traded as a resource with an appropriate level of 
mitigation. The tradable resources have increased by one: risk, 
mass, power, schedule, performance, and cost. The resources 
are hardware allocated during development, and at the same 
time risks are addressed and traded off. When the adequacy is 
demonstrated, the spacecraft is launched, and the flight perfor- 
mance is accomplished with a recognized risk. As seen for rule- 
based activities, there may be some failures but there will be 
more spacecraft launches to learn from. Thus, the risk has been 
used as a resource process. The goal is to optimize the overall 
risk posture by accepting risk in one area to benefit another. A 
strategy to recover from the occurrence of risk must also be 
considered. Risk trades will be made (best incremental return), 
possible risk consequences evaluated and developed, and deci- 
sion or recovery options accepted and tracked. How is the cost 
of risk reduced? Here it is important to consider its marginal 
cost. When the cost per “unit of risk reduction” in a given 
component or subsystem increases significantly — stop. It would 
be better to buy down risk somewhere else. 
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Figure 1-3. — Total NASA Earth and space science projects completed in better, faster, and cheaper environment (ref. 3). 
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Dr. Greenfield, the Deputy Associate Administrator in the 
Office of Safety and Mission Assurance at NASA Headquar- 
ters, gave a risk management presentation and illustrated 
through six examples how to use risk as a resource (ref. 1-4). 
One of his examples dealt with the class of electrical, elec- 
tronic, and electromagnetic (EEE) parts (ref. 1-5). Figure 1-^ 
shows the function, risk trade, possible risk consequence, and 
advantages for the class of parts to be used in a spacecraft. The 
risk trade that a project needs to make is the type of parts to use : 
class S, grade I, class B, or commercial off-the-shelf (COTS) 
parts. Each has possible risk consequences. For example, class 
S, grade 1 parts have poor availability and are usually older 
technology, which means higher mass and volume. The advan- 
tages are that they are low risk, fit long-life missions, and are 
more resistant to single-event upset (SEE). 

A measure of risk exists for a project that chooses to use a 
new technology, and it is now termed the technology infusion 
risk. The technology readiness level (TRL) scale ranges from 
1 to 9. A TRL of 9 is used for existing, well-established, proven 
(very low-risk) technology. A TRL of 1 is used for unproven, 
very high-risk technology at the basic research stage. New 


technology can save time and money so there is a critical point 
at which it should be put to use. The diagram of figure 1-5 
shows areas of high to low risk for the various risk elements. 
Called a risk surface (notational), if one looks along the EEE 
parts line, the commercial off-the-shelf parts (COTS) have 
more risk than B parts and B parts have more risk than S parts. 
Other risk elements are also shown in this figure. 

The Role of Safety and Mission Assur- 
ance (SMA) in Risk Management 

NASA’s Safety and Mission Assurance (SMA) Office has 
the core competencies to serve as a risk management consultant 
to the projects and is supporting the risk management plan 
development. It provides projects with risk-resource tradeoffs: 
strategies, consequences, benefits, and mitigation approaches. 
Its role is to interact in all phases of the project decision process 
(planning, design, development, and operations). It provides 
projects with residual risk assessment during the project life 
cycle. Figure 1-6 shows the mission failure modes that cause 
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Figure 1-6. — Some mission failure modes and methods leading to mission success (ref. 3). 


TABLE 1-1.— SAFETY AND MISSION ASSURANCE (SMA) ROLE IN RISK MANAGEMENT 


SMA area 

Typical areas involved in tradeoffs 

Quality assurance 

Documentation, surveillance, inspection, certification, audit, materials 
review board 

Configuration control 

Drawings, equipment lists, delivery schedules, approval authority, 
freeze control, as-built documentation 

Environmental requirements 

Design and test requirements, documentation, approvals, functional and 
environment tests, programmatics (component, subsystem, system), 
analysis 

EEE pans 

Parts lists, parts class, policy, nonstandard parts, traceability, derating, 
failure analysis, bum-in, selection, acquisition, upgrades, lot control, 
screening, destructive physical analysis, vendor control 

Reliability 

Single-failure-point policy, problem and failure reporting and 
disposition, design performance analysis (failure modes and effects 
criticality analysis, fault tree analysis, part stress, redundancy 
switching, worst case, single-event upset, reviews, redundancy 

Systems safety 

Documentation, hazard identification and/or impact, analysis (fault tree 
analysis, hazard, failure modes and effects criticality analysis, sneak 
circuit), structures and materials reviews, electrostatic discharge 
(ESD) control, tests, inspections, surveys 

Software product assurance 

| Initiation, problem and failure reporting and disposition, simulations, 
independent verification and validation (IVV), tests 
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risk and some of the methods used to manage them so that 
mission success can be achieved. The SM A role in risk manage- 
ment is presented in table 1-1, which shows the SMA area and 
other typical areas involved in project tradeoffs. For example, 
with EEE parts, 16 tradeoff areas are identified to help the 
project understand parts management risks. SMA must take the 
lead to answer some very important questions; Where are the 
problems? What has been done about them? Have all the risks 
been mitigated? Are we ready to fly? 
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Reliability Training 1 

1 . Which NASA Policy Guide explains risk management? 

A. 8701-draft I B. 7120.5A C. 2820.1 

2. What challenge is NASA facing? 

A. The NASA budget is shrinking. 

B. Many projects are being done faster, cheaper, and better. 

C. Dollars are very limited for facilities. 

D. All of the above. 

3. What are the tradeable resources that projects can use? 

A. Performance, cost, and schedule 

B. Mass, power, performance, cost, and schedule 

C. Risk, mass, power, performance, cost, and schedule 

4. How should the projects use the Safety and Mission Assurance Office? 

A. Design consultants 

B. Systems consultants 

C. Risk management consultants 


1 Answers are given at the end of this manual. 
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Chapter 2 

Reliability Mathematics and Failure Physics 


Mathematics Review 

Readers should have a good working knowledge of algebra 
and a familiarity with integral and differential calculus. How- 
ever, for those who feel rusty, the following review includes 
solved examples for every mathematical manipulation used in 
this manual. 

Notation 

The Greek symbol L (sigma) means “take the sum of,” and 
the notation 

n 

/'=! 


Rules that must be followed when manipulating these func- 
tions are given next. 

Rule /: 




Rule 3: 



means to take the sum of the xfs from / = 1 to / = n. 

The symbol ^ means “take the n th root of jc.” The square 

root is usually written as ^without the radicand (the 2). 

The Greek symbol n (pi) means “take the product of,” 
and the notation 

/=1 

means to take the product of the xfs from / = 1 to / = n. 

The notation x! is referred to as a factorial and is a shorthand 
method of writing Ix2x3x4x5x6x . . . x x or in general 
as x! = x(x- 1(jc- 2) . . . (1). However, 0! is defined as unity. 

Manipulation of Exponential Functions 

An exponential function is the Napierian base of the natural 
logarithms, e = 2.71828 . . raised to some power. For 
example, e 2 is an exponential function and has the value 7.389 1 . 
This value can be calculated on most calculators. 


Rounding Data 

Reliability calculations are made by using failure rate data. 
If the failure rate data base is accurate to three places, calcula- 
tions using these data can be made to three places. Use should 
be made of the commonly accepted rule (computer’s rule) to 
round the computational results to the proper number of sig- 
nificant figures. The Mathematics Dictionary (ref. 2- 1 ) defines 
rounding off: 

When the first digit dropped is less than 5, 
the preceding digit is not changed; when the 
first digit dropped is greater than 5 or 5 and 
some succeeding digit is not zero, the 
preceding digit is increased by 1; when the 
first digit dropped is 5 and all succeeding 
digits are zero, the commonly accepted rule 
is to make the preceding digit even, i.e., add 
1 to it if it is odd, and leave it alone if it is 
already even. 
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For example, if the reliability of a system is 0.8324, 0.83 1 6, or 
0.8315, it would take the form 0.832 if rounded off to three 
places. 


d(ax n ) 

— = nax 

dx 

Integration Formulas 

Example 4\ 


Only the following integration formulas are used in this 
manual: 


ft. A 

ll 

r \ h h n - a n+l 

f x n dx = - = b a (1) 

J n + 1 n + 1 

a 


d ( 4x ) _ 4 
dx 


( 5 ) 



a 


\ q e~ ax d : c = — 


-ax , q 
P 


e -ap_ e ~aq 


a 


(3) 


Example I: 


' 2 , * 2+ ‘ ^ 
x a x = — 

2 + 1 3 



3 

2 


(3) 2 - (2) 2 _ 9-4 _ 5 
2 2 ~ 2 


Example 5: 


V) 


= 2x 


2-1 


= 2x 



= (3)4^ 3 " 1 = 12x 2 


Partial Derivatives 

This manual uses the following partial derivative formula: 


dv _ 9(-yjz) 
3jcj 3jc 


( 6 ) 


Example 2: 


\ 4 e~ x dx = -e~ x \* = e ~ 3 -e~ 4 
J3 >3 

Example 3: 

f 4 ,- 2 , .. - g - 2 Y _ g - 8 - g - 6 

J3 2 '3 2 

Differential Formulas 

Only the following differential formulas are used in this 
manual: 


d(ax) 

— — - — a 
d x 


(4) 


TABLE 2-1 .—BINOMIAL COEFFICIENTS 


n 

Coefficient of each term of (a + b) n 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

0 

1 

i 











2 

1 

2 

1 









3 

1 

3 

3 

1 








4 

1 

4 

6 

4 

1 







5 

l 

5 

10 

10 

5 

i 






6 

1 

6 

15 

20 

15 

6 

I 





7 

1 

7 

21 

35 

35 

21 

7 

I 




8 

1 

8 

28 

56 

70 

56 

28 

8 

1 



9 

1 

9 

36 

84 

126 

126 

84 

36 

9 

1 

I 

10 

1 

10 

45 

120 

210 

252 

210 

120 

45 

| 10 

LL 
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Example 6: 


v = 2 ft x 3 ft x 4 ft = 24 ft 3 


x — 2 ft 
< y = 3 ft 
z = 4 ft 


dv 

dx 


yz = 12 ft 2 


Expansion of (a + b) n 

It will be necessary to know how to transform the 
expression (a + bf into a binomial expansion. This type of 
problem is easily solved by using table 2-1 and recalling that 


(a + b) n = a n + na n [ b + 




It 


t (« ~ 2)(w - 1X«) a n-3 b 3 + 

3/ 


( n - l)(n - 2) . . .(n - m + 1) 

m7 


x/'T+... + ^ 


(7) 


Example 7: 

Expand (a + <b) 4 . From table 2- 1 with rc = 4, 
(< a + &) 4 = a 4 4- 4a 5 /? + 6 a 2 b 2 + 4ab* +b 4 


what makes a part reliable? When asked, many people would 
say a reliable part is one purchased according to a certain source 
control document and bought from an approved vendor. Un- 
fortunately, these two qualifications are not always guarantees 
of reliability. The following case illustrates this problem. 

A clock purchased according to PD 4600008, procured from 
an approved vendor for use in the ground support equipment 
of a missile system, was subjected to qualification tests as part 
of the reliability program. These tests consisted of high- and 
low-temperature, mechanical shock, temperature shock, vibra- 
tion, and humidity. The clocks from the then sole-source ven- 
dor failed two of the tests: low temperature and humidity. A 
failure analysis revealed that lubricants in the clock’s mecha- 
nism froze and that the seals were not adequate to protect the 
mechanism from humidity. A second approved vendor was 
selected. His clocks failed the high-temperature test. In the 
process, the dial hands and numerals turned black, making read- 
ings impossible from a distance of 2 ft. A third approved 
vendor’s clocks passed all the tests except mechanical shock, 
which cracked two of the cases. Ironically, the fourth approved 
vendor’s clocks, though less expensive, passed all the tests. 

The point of this illustration is that four clocks, each de- 
signed to the same specification and procured from a qualified 
vendor, all performed differently in the same environments. 
Why did this happen? The specification did not include the 
gear lubricant or the type of coating on the hands and numer- 
als or the type of case material. 

Many similar examples could be cited, ranging from require- 
ments for glue and paint to complete assemblies and systems. 
The key to solving these problems is best stated as follows: To 
know how reliable a product is or how to design a reliable 
product, you must know all the ways its parts could fail and 
the types and magnitude of stresses that cause such failures. 
Think about this: if you knew every conceivable way a missile 
could fail and if you knew the type and level of stress required 
to produce each failure, you could build a missile that would 
never fail because you could eliminate 


Failure Physics 

When we consider reliability, we think of all the parts or 
components of a system continuing to operate correctly. There- 
fore a reliable system or product must have reliable parts. But 


( 1) As many types of failure as possible 

(2) As many stresses as possible 

(3) The remaining potential failures by controlling the 
level of the remaining stresses 


TABLE 2-2.— RESULTS OF QUALIFICATION TESTS ON 
SOURCE CONTROL DOCUMENT CLOCK 


Vendor 

High 

temperature 

Low 

temperature 

Mechanical 

shock 

Temperature 

shock 

Vibration 

Humidity 

I 


Fail 




Fail 

2 

Fail 






3 



Fail 




4 
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Sound simple? Well, it would be except that despite the 
thousands of failures observed in industry each day, we still 
know very little about why things fail and even less about how- 
to control the failures. However, through systematic data 
accumulation and study, we continue to learn more. 

As stated, this manual introduces some basic concepts of 
failure physics: failure modes (how failures are revealed); fail- 
ure mechanisms (what produces the failure mode); and failure 
stresses (what activates the failure mechanisms). The theory 
of and the practical tools for controlling failures are also 
presented. 

Probability Theory 

Fundamentals 

Because reliability values are probabilities, every student of 
reliability disciplines should know the fundamentals of prob- 
ability theory, which is used in chapter 3 to develop models 
that represent how failures occur in products. 

Probability defined , — Probability can be defined as follows: 
If an event can occur in A different ways, all of which are con- 
sidered equally likely, and if a certain number B of these events 
are considered successful or favorable, the ratio B/A is called 
the probability of the event. A probability, according to this 
definition, is also called an a priori (beforehand) probability 
because its value is determined without experimentation. It fol- 
lows that reliability predictions of the success of missile flights 
that are made before the flights occur are a priori reliabilities. 
In other words, a priori reliabilities are estimates of what may 
happen and are not observed facts. 

After an experiment has been conducted, an a posteriori prob- 
ability, or an observed reliability, can be defined as follows: If 
f(n) is the number of favorable or successful events observed 
in a total number of n trials or attempts, the relative frequency 
f(n)ln is called the statistical probability, the a posteriori prob- 
ability, the empirical probability, or the observed 
reliability. Note that the number of favorable events f(n) is a 
function of the total number of trials or attempts n. Therefore, 
as the number of trials or attempts changes, f(n) may also 
change, and consequently the statistical probability (or 
observed reliability) may change. 

Reliability of a coin . — To apply this theory, consider the 
physics of a coin. Assume that it has two sides, is thin, and is 
made of homogeneous material. If the coin is tossed, one of 
two possible landings may occur: with the head side up or tail 
side up. If landing heads up is considered more favorable than 
landing tails up, a prediction of success can be made by using 
the a priori theory. From the a priori definition, the probability 
of success is calculated as 

1 favorable event 1 
= — , or 50 percent 

2 possible events 2 


TABLE 2-3. — OBSERVED PROBABILITY OF SUCCESS 


Number of tosses, n 

1 

10 

100 

1000 

10 000 

Number of heads 
observed ,f{n) 

0 

7 

55 

464 

5080 

Relative frequency 
of probability of 
success. f{n)fn 

0 

0.70 

0.55 

0.464 

0.508 


This is an estimate of what should be observed if the coin is 
tossed but is not yet an observed fact. After the coin is tossed, 
however, the probability of success could be much more spe- 
cific as shown in table 2-3. 

The table shows two important phenomena: 

( 1 ) As the number of trials changes, the number of favorable 
events observed also changes. An observed probability of suc- 
cess (or observed reliability) may also change with each addi- 
tional trial. 

(2) If the assumptions made in calculating the a priori prob- 
ability (reliability prediction) are correct, the a posteriori 
(observed) probability will approach the predicted probability 
as the number of trials increases. Mathematically, the relative 
frequency j\n)/n approaches the a priori probability B/A as the 
number of trials n increases, or 

lim f( n ) _ 
n —> oo n A 

In the coin toss example, the predicted reliability was 0.50. 
The observed reliability of 0.508 indicates that the initial as- 
sumptions about the physics of the coin were probably cor- 
rect. If, as a result of 1 0 000 tosses, heads turned up 90 percent 
of the time, this could indicate that the coin was incorrectly 
assumed to be homogeneous and that, in fact, it was “loaded.” 
Inconsistency in the actual act of tossing the coin, a variable 
that was not considered in the initial assumptions, could also 
be indicated. Here again, even with a simple coin problem, it 
is necessary to consider all the ways the coin may “fail” in 
order to predict confidently how it will perform. 

Reliability of missiles . — In the aerospace industry, a priori 
probabilities (reliability predictions) are calculated for missiles 
in an effort to estimate the probability of flight success. Inher- 
ent in the estimate are many assumptions based on the physics 
of the missile, such as the number of its critical parts, 
its response to environments, and its trajectory. As in the coin 
problem, the ultimate test of the missile’s reliability prediction 
is whether or not the prediction agrees with later observations. 

If during flight tests, the observations do not approach the 
predictions as the number of flights increases, the initial 
assumptions must be evaluated and corrected. An alternative 
approach is to modify the missile to match the initial assump- 
tions. This approach is usually pursued when the reliability 
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prediction represents a level of success stated by the customer 
or when the predicted value is mandatory for the missile to be 
effective. This subject of reliability predictions is discussed 
again in chapter 4. 

In practice, reliability testing yields the knowledge needed 
to verify and improve initial assumptions. As experience is 
gained, the assumptions undergo refinements that make it pos- 
sible to develop more accurate reliability predictions on new 
missiles and systems not yet tested or operated. This informa- 
tion also provides design engineers and management with data 
to guide design decisions toward maximum missile or system 
reliability. Some reliability problems require the use of Bayes 
or Markovian probability theorems. Additional information on 
other topics is available in references 2-2 to 2-5 and in IEEE 
Reliability Society publications and other documents listed in 
the reference sections for chapters 3 to 9 and in the bibliogra- 
phy at the end of this manual. 


Example 4: If the probability of completing one countdown 
without failure R x is 0.9 and the probability of a second count- 
down failing is Q 0 = 0. 1 , the probability that the first will suc- 
ceed and the second will fail is R X Q 2 = (0.9)(0. 1) = 0.09. 

Theorem 3. — If the probability that one event will occur is 
R j and the probability that a second event will occur is R 2 and 
if not more than one of the events can occur (i.e., the events are 
mutually exclusive), the probability that either the first or sec- 
ond event, not both, will occur is R x + R 2 - A similar theorem 
can be stated for more than two events. 

Example 5 (true event method): Consider now the probabil- 
ity of completing two countdowns without a failure. Let the 
probabilities of success for the first and second countdowns be 
R l and R 2 and the probabilities of failure be Q x and Q 2 . To 
solve the problem using theorem 3, it is best to diagram the 
possible events as shown in figure 2-1. The mutually exclu- 
sive events are 


Probability Theorems 

The three probability theorems presented here are funda- 
mental and easy to understand. In these theorems and examples, 
the probability of success (reliability) is represented with an R 
and the probability of failure (unreliability) with a Q. The fol- 
lowing section (Concept of Reliability) examines what con- 
tributes to the reliability and unreliability of products. 

Theorem 1 . — If the probability of success is R, the probabil- 
ity of failure Q is equal to 1 -/?. In other words, the probability 
that all possible events will occur is Q + R = 1. 

Example J: If the probability of a missile flight success 
is 0.81, the probability of flight failure is 1-0.81 =0. ^.There- 
fore, the probability that the flight will succeed or fail is 
0.19 + 0.81 = 1.0. 

Theorem 2. — If R x is the probability that a first event will 
occur and R 2 is the probability that a second independent event 
will occur, the probability that both events will occur is R X R 2 . 
A similar statement can be made for more than two indepen- 
dent events. 

Example 2: If the probability of completing one countdown 
without a failure R x is 0.9, the probability of completing two 
countdowns without failure is R X R 2 = (0.9)(0.9) = 0.81. The 
probability that at least one of the two countdowns will fail is 
1 - R x R 2 = 1 - 0.81 =0.19 (from theorem 1). We say that at 
least one will fail because the unreliability term Q includes 
all possible failure modes, which in this case is two: one or 
both countdowns fail. 

Example 3: If the probability of failure Q { during one count- 
down is 0.1, the probability of failure during two countdowns 
is Q X Q 2 - (0.1X0.1) = 0.01. Therefore, the probability that at 
least one countdown will succeed is 1 -QjQ 2 = 1 -0.01 = 0.99. 
We say that at least one will succeed because the value 0.99 
includes the probability of one countdown succeeding and the 
probability of both countdowns succeeding. 


Q x first countdown fails 

R X Q 0 first countdown succeeds and second fails 

R x R 2 both countdowns succeed 

From theorem 3, the probability that one of the three events 
will occur is 


Q x + R\ Q 2 + R\ R 2 

But because these three events represent all possible events 
that can occur, their sum equals 1 (from theorem 1 ). Therefore, 

Q\ + ^1 Qi + ^1 ^2 = 1 

The probability of completing both countdowns without one 
failure R X R 2 is the solution to the proposed problem; therefore, 


R ] R 2 =l-{R l Q 2 +Qi) 


If /?, = 0.9, Q x = 0.1, R 2 = 0.9, and Q 2 = 0. 1, then 


Total 

possible 

events 


First 

Succeeds ( R ^ 

Second 

Succeeds (F? 2 ) 

countdown 


countdown 



Fails (Q 2 ) 


Fails (Q^) 


F?iF ? 2 


r^q 2 

Q^ 


Figure 2-1 . — Diagram of possible events — probability of completing 
two countdowns without a failure. 
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/?j/? 2 = l-[(0.9)(0. 1) + 0.1] 

= 1 - (0.9 + 0. 1) = 1 - 0. 1 9 = 0.8 1 

which agrees with the answer found in example 2 by using 
theorem 2. The expression for R\R-> can also be written 

R\R 2 = \-{R\Q 2 + Q[) = i -[(l - Q\)Qi + Q[ ] 

= 1 ~ (0i + 02 ~ Q\Qi) 

which is the usual form given for the probability of both events 
succeeding. However, note that in this expression, the event 
indicated by Q X Q 2 (both countdowns fail) is not a true possible 
event because we stipulated in the problem that only one 
countdown could fail. The term Q X Q 2 is only a mathematical 
event with no relation to observable events. In other words, if 
the first countdown fails, we have lost our game with chance. 

Example 6 (mathematical event method): Now consider the 
problem of example 5, ignoring for the time being the restric- 
tion on the number of failures allowed. In this case, the diagram 
of the possible events looks like that shown in figure 2-2. In 
this case the mutually exclusive events are 

R[R 2 both countdowns succeed 

R X Q 2 first countdown succeeds and second fails 

Q[R 2 first countdown fails and second succeeds 

Q[Q 2 both countdowns fail 

Keep in mind that in this example both countdowns may fail. 
From theorem 3, the probability that one of the four events 
will occur is 

R\R 2 + R x Q 2 + Q\ R 2 02 


Again, because the four events represent all possible events that 
can occur, their sum equals unity (from theorem 1); that is, 

R\ R 2 + R\ 02 + 0 ] ^2 + 01 02 = 1 
Solving for the probability that both countdowns will succeed is 

R { R 2 = 1 “(#102 + 01^2 + 0102 ) 

Substituting 1 - 0j for R^ and 1 - Q? for R 2 on the right side of 
the equation yields the answer given in example 5: 

= 1 “[(l - 0i)02 + 01 (l 02) + 0102] 

= 1 ~ (02 ~ 0102 + 01 ""0102 + 0102 ) 

= 1 “ (01 +02 ~ 01 02 ) 

This countdown problem has been solved in two ways to 
acquaint you with both methods of determining probability dia- 
grams, the true event and the mathematical event. The exer- 
cises at the end of this chapter may be solved by using the 
method you prefer. We suggest that you work the problems 
before continuing to the next section because they help you to 
gain a working knowledge of the three theorems presented. 

Concept of Reliability 

Now that you understand the concepts of probability and 
failure physics, you are ready to consider the concept of reli- 
ability. First, we will discuss the most common definition of 
reliability — in terms of the successful operation of a device. 
This definition, to fit the general theme of the manual, is then 
modified to consider reliability in terms of the absence of fail- 
ure modes. 


Total 

possible 

events 



R i R 2 


r ) q 2 


Q^2 


0^02 


Figure 2-2. — Diagram of possible events — number of failures not 
restricted. 


Reliability as Probability of Success 

The classical definition of reliability is generally expressed 
as follows: Reliability is the probability that a device will oper- 
ate successfully for a specified period of time and under speci- 
fied conditions when used in the manner and for the purpose 
intended. This definition has many implications. The first is 
that when we say that reliability is a probability, we mean that 
reliability is a variable, not an absolute value. Therefore, if a 
device is 90 percent reliable, there is a 10 percent chance that 
it will fail. And because the failure is a chance, it may or may 
not occur. As in the coin example, as more and more of the 
devices are tested or operated, the ratio of total success to 
total attempts should approach the stated reliability of 90 per- 
cent. The next implication concerns the statement . . will 
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operate successfully . . ” This means that failures that keep the 
device from performing its intended mission will not occur 
From this comes a more general definition of reliability: it is 
the probability of success. 

It should be obvious then that a definition of what consti- 
tutes the success of a device or a system is necessary before a 
statement of its reliability is possible. One definition of suc- 
cess for a missile flight might be that the missile leaves the 
launching pad; another, that the missile hits the target. Either 
way, a probability of success, or reliability, can be determined, 
but it will not be the same for each definition of success. The 
importance of defining success cannot be overemphasized. 
Without it, a contractor and a customer will never reach an 
agreement on whether or not a device has met its reliability 
requirements (i.e., the mission). 

The latter part of the classical definition indicates that a defi- 
nition of success must specify the operating time, the operating 
conditions, and the intended use. Operating time is defined as 
the time period in which the device is expected to meet its 
reliability requirements. The time period may be expressed in 
seconds, minutes, hours, years, or any other unit of time. Op- 
erating conditions are defined as the environment in which the 
device is expected to operate; they specify the electrical, 
mechanical, and environmental levels of operation and their 
durations. Intended use is defined as the purpose of the device 
and the manner in which it will be used. For example, a mis- 
sile designed to hit targets 1 000 miles away should not be con- 
sidered unreliable if it fails to hit targets 1100 miles away. 
Similarly, a set of ground checkout equipment designed to be 
90 percent reliable for a 1-hour tactical countdown should not 
be considered unreliable if it fails during 1 0 consecutive count- 
downs or training exercises. The probability of success in this 
case is (0.9) 10 = 0.35 (from probability theorem 2). 

In addition to these specified requirements, we must also 
consider other factors. As explained in the inherent product 
reliability section of this chapter, these areas have a marked 
effect on the reliability of any device. 

Reliability as Absence of Failure 

Although the classical definition of reliability is adequate for 
most purposes, we are going to modify it somewhat and 
examine reliability from a slightly different viewpoint. Con- 
sider this definition: Reliability is the probability that the 
critical failure modes of a device will not occur during a 
specified period of time and under specified conditions when 
used in the manner and for the purpose intended. Essentially, 
this modification replaces the words “a device will operate 
successfully'’ with the words “critical failure modes . . . will not 
occur.” This means that if all the possible failure modes of a 
device (ways the device can fail) and their probabilities of 
occurrence are known, the probability of success (or the reli- 
ability of a device) can be stated. It can be stated in terms of the 


probability that those failure modes critical to the performance 
of the device will not occur. Just as we needed a clear definition 
of success when using the classical definition, we must also 
have a clear definition of failure when using the modified 
definition. 

For example, let a system have two subsystems, A and B, 
whose states are statistically independent and whose separate 
reliabilities are known to be R A = 0.990 and R B = 0.900. The 
system fails if and only if at least one subsystem fails. The 
appropriate formula for system reliability is 

System — ^ B 

System =0.990 0.900 = 0.891 

Product Application 

This section relates reliability (or the probability of success) 
to product failures. 

What are the types of product failure modes? In general, 
critical equipment failures may be classified as catastrophic, 
tolerance, or wearout. The expression for reliability then be- 
comes 

R d = ProbabilityjC x t x W) 

where 

R D design-stage reliability of a product 
C event that catastrophic failure does not occur 
t event that tolerance failure does not occur 
W event that physical wearout does not occur 

This is the design-stage reliability of a product as described 
by its documentation (Note that R-, the inherent reliability, is a 
term often used in place of R D ). The documentation specifies 
the product itself and states the conditions of use and opera- 
tion. This design-stage reliability is predicated on the decisions 
and actions of many people. If they change, the design-stage 
reliability could change. 

Why do we consider design-stage reliability? Because the 
facts of failure are these: When a design comes off the drawing 
board, the parts and materials have been selected; the toler- 
ance, error, stress, and other performance analyses have been 
performed; the type of packaging is firm; the manufacturing 
processes and fabrication techniques have been decided; and 
usually the test methods and the quality acceptance criteria 
have been selected. The design documentation represents some 
potential reliability that can never be increased except by a 
design or manufacturing change or good maintenance. How- 
ever, the possibility exists that the observed reliability will be 
much less than the potential reliability. 
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To understand why this is true, consider the hardware as a 
black box with a hole in both the top and bottom. Inside are 
potential failures that limit the design-stage reliability of the 
design. When the hardware is operated, these potential fail- 
ures fall out the bottom (i.e., operating failures are observed). 
The rate at which the failures fall out depends on how the box 
or hardware is operated. Unfortunately, we never have just the 
design-stage failures to worry about because other types of 
failures are being added to the box through the hole in the top. 
These other failures are generated by the manufacturing, soft- 
ware, quality, and logistics functions, by the user or customer, 
and even by the reliability organization itself. We discuss these 
added failures and their contributors in the following para- 
graphs, but it is important to understand that because of the 
added failures, the observed failures could be greater than 
the design-stage failures. 

tf-Factors 

The other contributors to product failure just mentioned are 
called K- factors; they have a value between 0 and 1 and modify 
the design-stage reliability: 

^product = * ( K q xK m xK s xK r xK,xK u ) 

^-factors denote probabilities that design-stage reliability will 
not be degraded by 

Kq quality test methods and acceptance criteria 
K manufacturing, fabrication, and assembly techniques 
K s software 

K r reliability engineering activities 
logistics activities 
K u user or customer 

Any K- factor can cause reliability f o go to zero. If each Af-factor 
equals 1 (the goal), /? product = fi D . 

Interface Definition and Control 

This section is a training manual describing the elements of 
interface definition and control (ref. 2-7). 

This technical manual was developed as part of the Office of 
Safety and Mission Assurance continuous training initiative. 
The structured information contained herein will enable the 
reader to efficiently and effectively identify and control the 
technical detail needed to ensure that flight system elements 
mate properly during assembly operations (on the ground and 
in space). 

Techniques used throughout the Federal Government to 
define and control technical interfaces for hardware and soft- 


ware were investigated. The proportion of technical informa- 
tion actually needed to effectively define and control the es- 
sential dimensions and tolerances of system interfaces rarely 
exceeded 50 percent of any interface control document. Also, 
the current government process for interface control is very 
paper intensive. Streamlining this process can improve com- 
munication, provide significant cost savings, and improve over- 
all mission safety and assurance. 

The objective of this manual is to ensure that the format, 
information, and control of interfaces between equipment are 
clear and understandable and contain only the information 
needed to guarantee interface compatibility. The emphasis is 
on controlling the engineering design of the interface and is 
not on the functional performance requirements of the system 
or on the internal workings of the interfacing equipment. In- 
terface control should take place, with rare exception, at the 
interfacing elements and not further. 

Two essential sections of the manual are Principles of Inter- 
face Control and The Process: Through the Design Phases. The 
first discusses how interfaces are defined, describes the types 
of interfaces to be considered, and recommends a format for 
the documentation necessary to adequately control the inter- 
face. The second provides tailored guidance for interface defi- 
nition and control. 

This manual can be used to improve planned or existing in- 
terface control processes during system design and develop- 
ment and also to refresh and update the corporate knowledge 
base. The information presented will reduce the amount of pa- 
per and data required in interface definition and control pro- 
cesses by as much as 50 percent and will shorten the time 
required to prepare an interface control document. It also high- 
lights the essential technical parameters that ensure that flight 
subsystems will indeed fit together and function as intended 
after assembly and checkout. Please contact the NASA Center 
for Aerospace Information, (301) 621-0390 to obtain a copy. 

Appendix A contains tables and figures that provide refer- 
ence data to support chapters 2 to 6. Appendix B is a practical 
product assurance guide for project managers. 


Concluding Remarks 

Chapter 2 explained two principal concepts: 

1 . To design a reliable product or to improve a product, you 
must understand first how the product can fail and then how to 
control the occurrence of the failures. 

2. There is an upper limit to a product's reliability when a 
traditional method of design and fabrication is used. This limit 
is the inherent reliability. Therefore, the most effective reli- 
ability engineer is the designer because all his decisions di- 
rectly affect the product’s reliability. 

The three probability theorems were also illustrated. 
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Reliability Training 1 

la. What notation means to take the sum of the xfs from i = 1 to i = nl 

oo n 

A. Y*'s B. X** C. 

1=1 1=1 

n 

lb. If x = 100, Xj = 90, x 2 = 70, and * 3 = 50, what is -.X/) 2 ? 

/si 

A. 350 B.35xl0 2 C. 35 000 

2a. What notation means to take the « th root of jc? 

A. jc* B. e n C n4x 

2b. If x = 1 00, jc, = 90, x<j = 70, and jc 3 = 50, what is (* ~ x i ) ? 

A. 3.6 B. 59.2 C. 640 

3a. What notation means to take the product of the jc-’s from / = 1 to nl 

oo n 

a ri xs b n** c Yi x ‘ 

I * 0 /= 1 

3 

3b. If Jtj = 0.9, x 2 = 0.99, andx 3 = 0.999, what is O x i ? 

/= l 

A. 0.890 B. 0.800 C. 0.991 

4a. The notation x\ refers to what shorthand method of writing? 

A. Poles B. Factorial C. Polynomials 

4b. What does 10!/8! equal? 

A. 800 B. 900 C. 90 

5a. Describe the three rules for manipulation of exponential functions. 

i. Products 

A. Subtract exponents B. Add exponents C. Multiply exponents 

ii. Negative exponent 

A. Cancel exponents B. Balance exponents C. 1/Exponent 

iii. Division 

A. Add exponents B. Subtract exponents C. Multiply exponents 

* Answers are given at the end of this manual. 
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5b. Simplify, e 6 e 3 € 4 . 


A. <e 2 B. e 4 C. e 5 

6. What is the integral of the following functions? 

C x 2 ^ 

a. x dx 

A. x 4 /4 B. x 4 /4p C. [(jc 9 ) 4 -(x, ) 4 ]/4 

10 z 1 

b. f ' e"" dx 

A. -6 _ax /a B. [e _cur ‘ -e _aX2 ]/a C. -e - ' 

7. What is the derivative of the following functions? 

a. l(k 4 

A. 40;c 2 B.40x 3 C. 10x 3 

b. e 2 * 

A, e 2 * B.e^ 2 C. 2 g 2j 

8a. Write the first two terms of the binomial expansion (a + b) n . 

A. a n + (n - 1 )a n ~ [ b + . . . B. a' 2 - na n ~^b + . . . 

8b. Expand fa + £) 3 by using table 2-1. 

A. a 3 + 2a 2 6 + b 3 B. a 3 - 3a 2 b - 3ab 2 + b 3 

9. What needs to be done to design a reliable product? 

A. Test and fix it 

B. Know how its parts fail 

C. Know the type and magnitude of stresses that cause such failures 

D. Both B and C 

10. What are a priori reliabilities estimates of? 

A. What may happen B. What will happen 

11. What are a posteriori reliabilities observing? 

A. What may happen B. What has happened 


/a 

0 


C. a* + na n *6 + . . 


C. a 3 + 3o 2 b + 3ab 2 + b 3 


C. What has happened 


C. What will happen 
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12. If the probability of success is R, what is the probability of failure Q1 

A A+R B. I -R 2 C.l -R 

13. If R[. R 2 , and R ? are the probabilities that three independent events will occur, what is the 
probability that all three will occur? 

3 

A. R i + R 2 + /? 3 B. /?, ( R 2 + R 3 ) C. II R ‘ 

i = 1 

14. If/?,. R 2 , and/? 3 are the probabilities that three independent events will occur and not more than 
one of the events can occur, what is the probability that one of these events will occur? 

3 

A. B. R 3 (/?, + R 2 ) C. X R ‘ 

i=l 

15. What do we need to know if a device is to perform with classical reliability? 

A. Operating time and conditions 

B. How it will be used 

C. The intended purpose 

D. All of the above 

16. What do we need to know if a device is to perform with reliability defined as the absence of 
failure? 

A. Critical failure modes 

B. Operating time and conditions 

C. How it will be used 

D. The intended purpose 

E. All of the above 

17. What is the inherent reliability R- of the product you are working on? 

A. P c (the probability that catastrophic part failures will not occur) 

B. P t (the probability that tolerance failures will not occur) 

C. j P w (the probability that wearout failures will not occur) 

D. The product of all the above 

18. What is the reliability of your product? 

A. Ky (the probability that quality test methods will not degrade R.) 

B. K m (the probability that manufacturing processes will not degrade R-) 

C. K r (the probability that reliability activities will not degrade R;) 

D. K[ (the probability that logistic activities will not degrade R •) 

E. K u (the probability that the user will not degrade /?•) 

F. The product of all of the above and R i 
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Chapter 3 

Exponential Distribution and Reliability Models 


An expression for the inherent reliability of a product was 
given in chapter 2 as (ref. 3-1 ) 


R, = p c p,p* 


where 

P probability that catastrophic pan failures will not occur 
P probability that tolerance failures will not occur 
P probability that wearout failures will not occur 

In chapter 3, we discuss the term P and develop and explain 
its mathematical representation in detail. We then use the 
probability theorems to establish methods of writing and solv- 
ing equations for product reliability in terms of series and 
redundant elements. 


Exponential Distribution 


This distribution states that if an observed average failure rate 
A is known for a device, it is possible to calculate the probability 

P(x,t) of observing x= 0, 1,2,3 number of failures when the 

device is operated for any period of time t. 

To illustrate, consider a computer that has been observed to 
make 10 arithmetic errors (or catastrophic failures) for every 
hour of operation. Suppose that we want to know the probabil- 
ity of observing 0, 1, and 2 failures during a 0.01 -hr program. 
From the data given, 

x (observed failures) = 0, 1, and 2 
t (operating time) = 0.01 hr 
A (failure rate) = 10 failures/hr 

The probability of observing no failures P{ 0, 0.0 1 ) is then 


P( 0. 0.01) = 


(10x0,01)° e ( |0xft01 ) 
0 ! 


To understand what is meant by exponential distribution, 
first examine a statistical function called the Poisson distribu- 
tion, which is expressed as (ref. 3-2) 


1 x e 


-o.i 


■ = e 


-o.i 


= 0.905 


P{x,t) 


(A t) x t~ h 

x! 


where 


The probability of observing one failure P( 1 , 0.01 ) is 


/>(l,0.0l) = 


(10xQ.Ql) 1 e~ {10x0,0 ' ) 

— 


x observed number of failures 

t operating time 

A average failure rate 


(ftlVe -0 - 1 

1 


= 0.1x0.905 = 0.091 
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The probability of observing two failures P( 2, 0.01 ) is 


P( 2. 0.01) = 


(10x0.01) 2 e~ (10xft01) 

2 ! 


Frequently, these assumptions are not realistic and the resultant 
reliability predictions are usually high. They may bear little 
resemblance to the reliability finally observed when the prod- 
uct is tested. Later in this manual, we will let 


(0. l) 2 e~° 1 _ 0.01 x 0,905 
2x1 ” 2 


p c = R = e ~ x ‘ 


0.00905 

2 


= 0.0045 


Remember that the definition of P c is the probability that 
no catastrophic failures will occur. So, for the computer, 
P c = P{ 0, 0.01 ) = 0.905. In other words, there is a 90.5-percent 
chance that no arithmetic errors will occur during the 0.01 -hr 
program. This is the reliability of the computer for that particu- 
lar program. 

Again the Poisson distribution for x = 0 (i.e., no observed 
failures) is 


to keep the notation simple. 

On the other hand, it is also common to use e~^ to represent 
the observed product reliability. In this case the observed 
average failure rate A represents the combination of all types of 
failures including catastrophic, tolerance, and wearout. If the 
total product failure rate is A', then 

R = e - A '' = t~ h P t P K (K q K m K r K t K u ) 


Failure Rate Definition 


\0-k 


, (a/) e 

p(0, = = e M 


01 


The term e - ^ is called the exponential distribution and is the 
simplest form of P Consequently, for a device that has an 
average failure rate A, the probability of observing no failures 
for a period of time r is (ref. 3-3) 


Pc 


-k 


The expression for inherent reliability now takes the form 


R,=e~ ?J P,P H 

or in the more general expression for total product reliability, 


R = t~ h P l P w (K q K m K r K t K u ) 


At this point it is probably a good idea to digress for a moment 
to explain why these expressions for reliability may differ from 
those used elsewhere. During the conceptual and early research 
and development phases of a program, it is common practice 
(and sometimes necessary because of a lack of information) to 
assume that P t = 1 (the design is perfect), that P w = 1 (no 
wearout failures will occur), and that the A'- factors all equal 1 
(there will be no degradation of inherent reliability). These 
assumptions reduce the inherent reliability and product reli- 
ability expressions to 


R i= R = e -^ 


The failure rate A as used in the exponential distribution e - ^ 
represents random catastrophic part failures that occur in so 
short a time that they cannot be prevented by scheduled main- 
tenance (ref. 3-4). Random means that the failures occur 
randomly in time (not necessarily from random causes as many 
people interpret random failure) and randomly from part to 
part. For example, suppose a contractor uses 1 million inte- 
grated circuits in a computer. Over a period of time he may 
observe an average of one circuit failure every 100 operating 
hrs. Even though he knows the failure rate, he cannot say which 
one of the million circuits will fail. All he knows is that on the 
average, one will fail every 100 hrs. In fact, if a failed circuit is 
replaced with a new one, the new- one, theoretically, has the 
same probability of failure as any other circuit in the computer. 
In addition, if the contractor performs a failure analysis on each 
of the failed circuits, he may find that every failure is caused by 
the same mechanism, such as poorly welded joints. Unless he 
takes some appropriate corrective action, he will continue to 
observe the same random failures even though he knows the 
failure cause. 

A catastrophic failure is an electrical open or short, a me- 
chanical or structural defect, or an extreme deviation from an 
initial setting or tolerance (a 5-percent-tolerance resistor that 
deviated beyond its end-of-life tolerance, say to 20 percent, 
would be considered to have failed catastrophically). 

The latter portion of the failure rate definition refers to the 
circumstance under which a failure is revealed. If a potential 
operating failure is corrected by a maintenance function, such 
as scheduled preventive maintenance where an out-of- 
tolerance part could be replaced, that replacement cannot be 
represented by A because it did not cause an operating or 
unscheduled failure. Here we see one of the many variables that 
affect the operating failure rate of a product: the maintenance 
philosophy. 
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TABLE 3-1. —COMMON FAILURE RATE 
DIMENSIONS 


Failures/hr. 

percent 

Failures/ 
10'’ hr 

Failures/ 
I0 ,) hr 

10.0 

100.0 

100 000.0 

1.0 

10.0 

10 000.0 

.1 

1.0 

1 000.0 

.01 

.1 

100.0 

.001 

.01 

10.0 

.0001 

.001 

1.0 

.00001 

.0001 

.1 

.000001 

.00001 

.01 

.0000001 

.000001 

.001 


Failure Rate Dimensions 

Failure rate has the dimension of failure per unit of time, 
where the time is usually expressed in 10* hours or cycles. 
Some government documents express A in percent failures per 
10 5 hours. Table 3-1 shows the most common usage. Gener- 
ally, the form that permits calculations using whole numbers 
rather than decimal fractions is chosen. 


“Bathtub” Curve 

In the Poisson distribution, A was referred to as an average 
failure rate, indicating that A may be a function of time A (f). 

Figure 3-1 shows three general curves representing A (r) 
possibilities. Curve A shows that as operating time increases, 
the failure rate also increases. This type of failure rate is found 
where wearout or age is a dominant failure mode stress (e.g., 
slipped clutches or tires). Curve B shows that as operating time 
increases, the failure rate decreases. This type of failure rate has 
been observed in some electronic parts, especially semiconduc- 
tors. Curve C shows that as operating time increases, the failure 
rate remains constant. This type of failure rate has been observed 
in many complex systems and subsystems. In a complex system 
(i.e., one with a large number of parts), parts having decreasing 



Figure 3-1 . — Failure rate curves. 


failure rates reduce the effect of those having increasing failure 
rates. The net result is an observed near-constant failure rate for 
the system. Therefore, pan failure rates are usually given as a 
constant although in reality they may not be. This manual deals 
only with constant part failure rates because they are related to 
system operation. Even if the failure rates might be changing 
over a period of time, the constant-failure-rate approximation 
is used. 

If the failure rate for a typical system or complex subsystem 
is plotted against operating life, a curve such as that shown in 
figure 3-2 results. The curve is commonly referred to as a 
‘'bathtub” curve. The time f Q represents the time at which the 
system is first put together. The interval from r Q to / j represents 
a period during which assembly errors, defective parts, and 
compatibility problems are found and corrected. As shown, the 
system failure rate decreases during this debugging, or burn-in, 
interval as these gross errors are eliminated. The interval from 
/j to ? 2 represents the useful operating life of the equipment and 
is generally considered to have a constant failure rate. During 
this time, the expression P = e~^ is used. Therefore, when 
using e _A/ , we assume that the system has been properly 
debugged. In practice, this assumption may not be true, but we 
may still obtain an adequate picture of the expected operating 
reliability by accepting the assumption. The interval from t 2 to 

represents the wearout period during which age and deterio- 
ration cause the failure rate to increase and render the system 
inoperative or extremely inefficient and costly to maintain. 

The following analogy should help to summarize the con- 
cepts of failure and failure rate. A company picnic is planned to 
be held on the edge of a high cliff. Because families will be 
invited, there will be various types of people involved: large, 
small, young, and old, each type with its own personality and 
problems. Picnic officials are worried about someone’s falling 
over the cliff. The question is, What can be done about it? Four 
possible solutions are presented: 

( 1 ) Move the picnic farther back from the cliff. The farther 
back, the less the chance someone will fall over. 

(2) Shorten the picnic time. The shorter the picnic, the less 
time someone has to walk to the cliff. 



Time 


Figure 3-2. — Failure rate versus operating time. 
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(3) Look over the cliff to see if anyone has fallen. A good idea 
because people would know when to call the ambulance. 
Unfortunately, looking over the cliff does not keep others from 
falling. It is possible, however, that going to the bottom of the 
cliff to see who has fallen over might reveal that every 
15 minutes one person over the age of 99 falls over the cliff. 
Knowing this, all persons over 99 could be sent home and the 
picnic saved from further tragedy. 

(4) Build a high fence to separate the cliff edge from the 
picnickers. Obviously, this is the best solution because it is 
doubtful that anyone would climb the fence just to get to the 
cliff. 

Now, let us look at this picnic-to-failure rate analogy. Say 
that we are building a system (picnic) made of many parts 
(people) and that there are many types of parts; some large, 
some small, and some new' and untried, such as integrated 
circuits. Some of these parts, the composition resistors for 
instance, are old and mature. Each part has its own personality 
(the way it was fabricated). Our problem is how to keep these 
parts from failing (falling over the cliff). Again we have four 
possible solutions: 

(1) Reduce the stresses on the parts (move the picnic back 
from the cliff); the lower the stresses, the fewer the failures. 

(2) Reduce the operating time (the picnic); the shorter the 
operating time, the less chance a part has to fail. Part failure 
rates can be established (look over the cliff to see if anyone has 
fallen), but this only helps if we know what parts (people) are 
failing. Once we know this, we can eliminate those parts from 
our system. 

(4) Eliminate the failure mechanisms of the part (build a 
fence to separate the cliff edge from the picnic). This is the best 
answer, of course, because if we eliminate the cause of part 
failures, we cannot have any system failures. 

Mean Time Between Failures 

For the exponential distribution, the reciprocal of the failure 
rate is the mean time between failures (MTBF) and is the 
integral of the exponential distribution: 


If the time dimension is given in cycles, the MTBF becomes 
mean cycles between failures (MCBF), a term also in common 
use. For a nonrepayable device, mean time to failure (MTTF) 
is used instead of MTBF. For a repairable device MTBF, is 
usually equal to MTTF. 

For example, if a device has an MTBF of 200 hrs, this neither 
means that the device will not fail until 200 operating hours 
have accumulated nor that the device will fail automatically at 
200 hrs. MTBF is exactly what is says: a mean or average value, 
which can be seen from 

-h -if MTBF 

e = e 

When the operating time t equals the MTBF, the probability 
of no failure is (using exponential tables or a slide rule) 

e -MTBF/MTBF =e -l =0368 

which means that there is a chance of 1 - 0.368 = 0.632 that the 
device will fail before its MTBF is reached. In other words, if 
a device has an MTBF of 1000 hrs, replacing the device after 
999 hrs of operation will not improve reliability. To show the 
concept of a mean value in another way, consider the following 
empirical definition of MTBF: 

. Total test hours 

MTBF = 

Total observed failures 

Note that the time when the failures were observed is not 
indicated. The assumption of a constant failure rate leads to a 
constant time between failures, or MTBF. 

Calculations of P c for Single Devices 

If a failure rate for a device is known, the probability of 
observing no failures for any operating period t can be calcu- 
lated. 

Example 1 : A control computer in a missile has a failure rate 
of 1 per 10 2 hrs. Find P c for a flight time of 0. 1 hr. 

Solution 1: 


MTBF = — f“ e~ h dt 
A Jo 



P c=e -X/ =e -( I/ *0 2 )(0') = e _, x , 0 =e 


-3 


- 0.001 


= 0.999 


1 ( 1 [ \ I l Therefore, there is one chance in a thousand that the control 

computer will fail. (Note: if h or r/MTBF is less than 0.01, 
P ( . ~ 1 - A/, or 1 - r/MTBF.) For example. 


Therefore, if a device has a failure rate of one failure per 1 00 hrs, 

its MTBF is 100 hrs. P c = e" 0001 - 1-0.001 = 0.999 
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If Xt, or f/MTBF, is greater than 0.01 , use exponential tables to 
find P r as shown here: 

P c = e -0,08 = 0.923 

Example 2: The same type of problem can be solved if the 
MTBF is known. The MTBF of a tape reader used in ground 
support equipment is 100 hrs. Find P c for a 2-hr operation. 
Solution 2: 

P t .= C -'/ MTBF =e -2/100 =e -0.02 =()980 

If a specific P c is required for a specified operating time, the 
required failure rate, or MTBF, can be calculated. 

Example 3: A relay is required to have a 0.999 probability of 
not failing for 10 000 cycles. Find the required failure rate and 
MCBF. 

Solution 3: 

R = t~ h 

0.999 = e~° 001 =e - A K cydes ) 

Equating exponents gives 

A(l 0 4 cycles) = 0.001 

^ _ 0.001 _ 1 failure 
10 4 10 7 cycles 

The required MCBF is therefore 

MCBF = — = 10 7 cycles 
A 


Reliability Models 

In the following sections we replace P c = e - ^, the reliability 
of a part, with an R to keep the notation simple. 

Calculation of Reliability for Series-Connected Devices 

In reliability, devices are considered to be in series if each 
device is required to operate without failure to obtain system 
success (ref. 3-5). A system composed of two parts is represented 
in a reliability diagram, or model, as shown in figure 3-3. If the 
reliability R for each part is known (probability theorem 2, 
ch. 2), the probability that the system will not fail is 



Figure 3-3. — Series model. 


R s =R } R 2 

(We assume that the part reliabilities are independent; i.e., the 
success or failure of one part will not affect the success or 
failure of another part.) If there are n parts in the system with 
each one required for system success, the total system reliabil- 
ity is given by 

n 

R s - *1 Ri R?> ■■■ Rn = fl R i 

;= l 

where 

R s probability that system will not fail 
Rj reliability of j th part 
n total number of parts 

The expression 

«s = flRj 

7 = 1 

is often called the product rule. 

Example 4 : A system has 100 parts, each one required for 
system success. Find the system reliability R s if each part has 
R = 0. 99. 

Solution 4: 

n 100 

Rs = I~[ Rj = fl Rj - R\ *2 *3 ■ ■ ■ *ioo 

;'=i _/=■ 

= (0.99)(0.99)(0.99) . . . (0.99) = (0.99) 100 



Therefore, the probability that the system will succeed is about 
37 percent. 

Example 5: For a typical missile that has 7000 active parts 
and a reliability requirement of 0.90, each part would have to 
have a reliability R p of 0.999985, which is calculated using 
table A- 1 : 
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/ \ 7000 

K) = 0.90 = e 

Solution 5: Therefore, 


-0.105 


R p = (e -0 105 ) 177000 =e -1 ' 5 

= 1-0.000015 = 0.999985 


.5xlCT' _ e -0.0000 15 


The product rule can also be expressed as 


R s - J~J Rj - R] /?2 ^3 * • 
j = 1 

= e" Xn e -X-r 2 e _X3 ' 3 ... e“ X "'" 
_ 1 • ■ A„f„ ) 


= exp 


< n ' 

-Zv 

y=i 




Example 6: Find the system reliability from the model shown 
in figure 3-4. 

Solution 6: 

Step 1 

3 

^^A jtj = AjTj +A2^2 + A^?3 

= 10/10 3 (10) + 20/10 3 (4) + 100/10 3 (2) 

= 100/10 3 +80/10 3 +200/10 3 = 380/10 3 


Step 2 


= exp 


( 3 X 

-Zv/ 

7=1 


= e -380/10 3 _ e -0 .38 _ a6g4 


If the fy’s are equal (i.e., each part of the device operates for the 
same length of time), the product rule can further be reduced to 


where 

A^ failure rate of / h part 
tj operating time of / h part 


/ 

R s = exp 

V 


A 


y= i 


Therefore, if for each series-connected part in a system the 
failure rate and operating time are known, the system reliability 

n 

can be calculated by finding -^A ; f ; and raising e to the 

;= 1 



A 


power. 




where t is the common operating time. 

Example 7: Find the reliability of the system shown in fig- 
ure 3-5. 

Solution 7: 

Step 1 

3 

A .j = A., + A-2 + A-3 = 7/10 3 + 5/I0 3 +6/10 3 = 18/10 3 
7=1 



Figure 3-4.— Series model using failure rates and operating times. 



Figure 3-5. — Series model with operating times equal. 
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Step 2 


R s = exp 


< 3 , k 

-5>j 

V 7=1 7 


t _ e -18/!0’(10) _ e -180/10 J 


= e"°- 18 =0.835 


Calculation of Reliability for Parallel-Connected Devices 
(Redundancy) 

In reliability, devices are considered to be in parallel if one or 
more of them can fail without causing system failure but at least 
one of them must succeed for the system to succeed. First we 
consider simple redundancy. 

Simple redundancy . — If n devices are in parallel so that only 
one of them must succeed for the system to succeed, the devices 
are said to be in simple redundancy. The model of a two-part 
redundancy system presented in figure 3-6 illustrates this 
concept. In other words, if part 1 fails, the system can still 
succeed if part 2 does not fail, and vice versa. However, if both 
parts fail, the system fails. 

From probability theorem 3 in chapter 2, we know that the 
possible combinations of success R and failure Q of two devices 
are given by 


R\ R 2 + R\ Q 2 + Q\ R 2 + Q\ Qi 


where 


R\Rj 

both parts succeed 

R | Q 2 

part I succeeds and part 2 fails 

Q\ R 2 

part 1 fails and part 2 succeeds 

Q\Qi 

both parts fail 


We also know that the sum of these events equals unity since 
they are mutually exclusive (i.e., if one event occurs, the others 
cannot occur). Therefore, 


R\ /?2 + R\ Qi + Q\ R 2 + Q\ Q 2 “ 1 

Because at least one of the parts or devices must succeed in 
simple redundancy, the probability of this happening is given 
by 




Figure 3-7. — Space capsule guidance model. 


R\ /?2 + R\ Q 2 + Q\ &2 ~ 1 “ Q\ Q 2 


In simple terms, if the only way the redundant system can fail 
is by all redundant parts failing, the probability of success must 
be equal to 1 minus the probability that all redundant parts will 
fail (i.e., R = I - Q) from probability theorem 1 in chapter 2. 
This reasoning can be extended to n redundant parts if at least 
one of the n parts must succeed for the system to succeed. 

Example 8: Suppose that a space capsule can be guided three 
ways: (1) automatically with R j = 0.9, (2) semiautomatically 
with /? 2 = 0.8, (3) manually with = 0.7. The diagram of 
successful guiding, assuming that the three ways are indepen- 
dent of each other, is shown in figure 3-7. From probability 
theorem 3 in chapter 2, the possible events are given by 


R\ To R3 + R\ R2 Qi + ^1 O2 r ?> + Q\ R 2 R 3 + ^1 Q2 Q3 
+ Q\ Q2 ^3 + Q\ R 2 Q ?> + Q1Q2Q3 

Because the sum of these probabilities is equal to unity and at 
least one of the control systems must operate successfully, the 
probability that guidance will be successful ^ 2u idance ]S 


^guidance = ^1^2 ^3 + ^1^203 + R \Ql R 3 + Q) R 2^3 

+ Q 2 03 + Q\ Q 2 R 3 + Q\ ^ 2^3 

= 1 - 010203 = 1 ~ [0 * R\ )(1 ' *2 )(l - *3>] 
= 1-[(1 -0.9)0 -0.8)(l -0.7)] 

= 1-[(0.1)(0.2)(0.3)] 

= 1 -(0.006) = 0.994 

In general, then, for simple redundancy 


^simple redundant ^ j[ Qj ^ (G1Q2Q3 * 2 n) 

y=i 


Figure 3-6. — Simple redundancy model. 


where 
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Figure 3-8. — Simple redundancy model using failure 
rates and operating times. 


n 

Tit total probability of failure 
j = I 

Q j total probability of failure of j th redundant part 
n total number of redundant parts 



Figure 3-9. — Compound redundancy model. 


To simplify the notation, let R : = R, = R } and 0, = Q, = Q y This 
reduces the expression to 

R 3 + R 2 Q + R 2 Q + R 2 Q + RQ 2 + RQ 2 + RQ 1 + Q 3 


Example 9: Find the reliability of the redundant system 
shown in figure 3-8. 

Solution 9: 

Step 1 — Solve for the reliability of parts 1 and 2: 




"Vi 


= e 


-A-)t 7 


= e -[( I20 /'0 6 )xlO- , ] =e _o. 120 
= e -[( 34 0 /1 0 6 ) X l0' , l =e - 0 .3 40 


= 0.887 
= 0.712 


or 


/? 3 +3* 2 G+3*e 2 +e 3 

Because the sum of these probabilities equals unity and at least 
two of the three parts must succeed, the probability for success 
is given by 


R s = R 3 + 3R 2 Q = 1 - (3 RQ 2 + £> 3 ) 


Step 2 — Solve for the unreliability of each part: 

<2i = I-/f, =0.113 
Q 2 =\-R 2 = 0.288 

Solve for the reliability of the redundant system: 

Simple redundant = 1 ~ Gift = 1 - (0. H3)(0.288) 

= 1-0.033 = 0.967 

There is a 96. 7-percent chance, therefore, that both parts will 
not fail during the 1000-hr operating time. 

Compound redundancy . — Compound redundancy exists 
when more than one of n redundant parts must succeed for the 
system to succeed. This can be shown in a model of a three- 
element redundant system in which at least two of the elements 
must succeed (fig. 3-9). 

From probability theorem 3 in chapter 2, the possible events 
are 

R\R 2 Ri + R\R 2 Qt> + R1Q2R3 Q\ R 2 ^3 + tfi Q2Q3 
+£? i (? 2^3 + Q1R2Q1 + Q\QzQ^ 


where 3 RQ 2 represents one part succeeding and two parts 
failing and Q 3 represents all three parts failing. 

Example 10: Assume that there are four identical power 
supplies in a fire control center and that at least two of them 
must continue operating for the system to be successful. Let 
each supply have the same reliability, R = 0.9 (which could 
represent e“^ or R- or R). Find the probabil ity of system success 

D 

^simple redundant 4 

Solution 10: The number of possible events is given by 

(R + Q) 4 = R 4 + 4 P?Q + 6 R 2 Q 2 + 4/?£ 3 + Q 4 

The sum of the probabilities of these events equals unity; 
therefore, the expression for two out of four succeeding is 

R s = R 4 +4R J Q + 6R 2 Q 2 =l-(4/?<2 3 + <2 4 ) 

Substituting R = 0.9 and Q = 1-0.9 gives 

R s = \- (4 RQ 3 +Q 4 )=\- [4(0.9)(0. 1) 3 + (0. 1) 4 ] 

= 1 - [(3.6)(0.00I) + 0.0001] = 1 - (0.0036 + 0.0001) 
= 1-0.0037 = 0.996 
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Figure 3-10. — Model of system with series and redundant elements. 


Calculation of Reliability for Complete System 

To find the reliability for a complete system, begin by 
developing a model for the system, write the equation for the 
probability of success from the model, and then use the failure 
rates and operating times of the system elements to calculate the 
reliability of the system (refs. 3-6 to 3-8). 

Example 11: Consider the system model with series and 
redundant elements shown in figure 3-10. 

Solution 11: The equation can be written directly as 

Q4Q5Q6) 

where R ] represents the probability of success of the series 
parts and (1 - Q^Q^Q^) represents the probability of success of 
the three parts in simple redundancy. If we know that 

/?, =0.99 = e“° 01 /? 4 = 0. 85 

R 2 = 0.999 = e~° 001 R 5 = 0.89 
/f 3 = 0.95 = e -005 R^ =0.78 

where R may represent e - ^, inherent reliability R-. or observed 
product reliability depending on the stage of product develop- 
ment, then the reliability of the system is 


R s = e _00l e' 000l e _005 [l - (1 - 0.85)(1 - 0.89)(I - 0.78)] 

= e -006 ^ -(0. 15)(0. 1 1)(0.22)] = e -0 ^'(l - 0.00363) 
_ ^ -0.06 1^-0.0036 _ g -0.065 _ q 

However, this does not mean that there will be no equipment 
failures. The system will still succeed even though one or two 
of the redundant paths have failed. 

Example 12: Write the equation for the system shown in 
figure 3-11. 

Solution 12: The equation can be written directly as 

/^ = /?! /?2 [l “ ( ^ 3 04 2 s + 23 ^4 Q$ + 23 24 ^5 
+23242s )]0 “262?) 

where R { R-> is the probability that the two parts in series will not 
fail, 1 -(R^Q 4 Q^ + . ■ . + 232425 ) is the probability that two out 
of three of the compound redundant parts will not fail, and 
(1 “ 2 6 2 7 ) is probability that both the simple redundant 
parts will not fail. If data giving the reliabilities of each part are 
available, insert this information in the system success equation 
to find the system reliability. 



Figure 3-1 1 . — System reliability model using series, simple redundancy, and compound redundancy elements. 
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Figure 3-12. — Model with series elements in redundant paths. 


Example 13: Write the equation for the system shown in 
figure 3-12. 

Solution 13: The equation can be written directly as 

*, = JW 7 {i -[&&(! -*««,)]} 

where R^R^R-j is the reliability of the series parts, (1 -/? 4 /? 5 ) is 
the probability that R 4 or R 5 will fail in the bottom redundant 
path, and { 1 - {Q^Q-^ 1 - /? 4 /? 5 )] } is the reliability of the three 
paths in simple redundancy. 


Concluding Remarks 

Chapter 3 has presented several important concepts that you 
should have clearly in mind: 

( 1 ) The exponential distribution e _ ** represents the probabil- 
ity that no catastrophic part failures will occur in a product. 

(2) The failure rate A as used in e - ^ is a constant and 
represents the rate at which random catastrophic failures occur. 

(3) Although the cause of failure is known, random failures 
may still occur. 

(4) The mean time between failures (MTBF) is the reciprocal 
of the failure rate. 

(5) In reliability, devices are in series if each one is required 
to operate successfully for the system to be successful. Devices 
are parallel or redundant if one or more can fail without causing 
system failure but at least one of the devices must succeed for 
the system to succeed. 

In addition, you should be able to calculate the following: 

( ! ) The reliability of a device, given failure rate and operating 
time 

(2) The reliability of devices connected in series from the 
product rule 

r\ 

r =YI r j 

>1 

(3) The reliability of devices connected in simple redundancy 
from 


^simple redundant ^ 


j= I 


(4) The reliability of n devices connected in compound 
redundancy by expanding (R + Q) n and collecting the appropri- 
ate terms. 

And finally, you should be able to combine the four methods 
described above to calculate the reliability of a total system. 

In 1985, alternative methodologies were introduced in the 
form of computer reliability analysis programs. One such 
underlying model uses a Weibull failure rate during the burn- 
ing, or “infant mortality,” period and a constant failure rate 
during the steady-state period for electronic devices. Initial 
results indicate that given a 15- to 40-yr system life, the infant 
mortality period is assumed to last for the first year. Of course, 
the higher the stress of the environment, the shorter the period 
of infant mortality. The point is that there are many ways to 
perform reliability studies, and different methodologies could 
be equally appropriate or inappropriate. Appendix C describes 
five distribution functions that can be used for reliability 
analysis. Table C-l shows the time-to-failure fit for various 
systems. The basic criteria relate to the distribution of failures 
with time. 


References 

3-1. Failure Distribution Analyses Studies, Vols. I. II, and III. Computer 
Applications Inc.. New York. Aug. 1964. {Avail. NTIS; AD-631525, 
AD-631526, AD-631527.) 

3-2. Hoel, Paul G.: Elementary Statistics. John Wiley & Sons, Inc.. 1960. 
3-3. Calabro, S.: Reliability Principles and Practices McGraw-Hill, 1962. 
3-4. Reliability Prediction of Electronic Equipment. MIL-HDBK-217E. 
Jan. 1990. 

3-5. Electronic Reliability Design Handbook. MIL-HDBK-338, Vols. 1 
and II, Oct. 1988. 

3-6. Bloomquist, C.; and Graham, W.: Analysis of Spacecraft On-Orbit 
Anomalies and Lifetimes, (PRC R-3579. PRC Systems Sciences Co.; 
NASA Contract NAS5-27279), NASA CR-170565, 1983. 

3-7. Government-Industry Data Exchange Program (GIDEP). Reliability- 
Maintainability (R-M) Analyzed Data Summaries. Vol. 7, Oct. 1985. 
3-8. Kececiouglu, D.: Reliability Engineering Handbook, Vols. 1 and 2. 
Prentice-Hall, 1991. 


30 


NASA/TP — 2000-207428 


I 



Reliability Training 1 

la. Of 45 launch vehicle flights, 9 were determined to be failures. What is the observed reliability? 

A. 0.7 B. 0.8 C. 0.9 

lb. What is the observed reliability if the next five flights are successful? 

A. 0.72 B. 0.82 C. 0.87 

lc. After the five successes of part lb, how many more successes (without additional failures) are required for 
a reliability of R = 0.90? 

A. 20 B. 30 C. 40 

2. A three-stage launch vehicle has a reliability for each stage of R x = 0.95, R 2 = 0.94, R 3 = 0.93. 

a. What is the probability of one successful flight? 

A. 0.83 B. 0.85 C. 0.87 

b. What is the probability of flight failure for part a? 

A. 0.00021 B. 0. 15 C. 0.17 

c. What is the probability of two successful flights? 

A. 0.689 B. 0.723 C. 0.757 

3. You are taking a trip in your car and have four good tires and a good spare. By expanding (R + Q )^ , 

a. How many events (good tires or flats) are available? 

A. 16 B. 32 C. 64 

b. How many combinations provide four or more good tires? 

A. 6 B. 7 C. 16 

c. If R - 0.99 for each tire and a successful trip means you may have only one flat, what is the probability 
that you will have a successful trip? 

A. 0.980 B. 0.995 C. 0.9990 

4. A launch vehicle system is divided into five major subsystems, three of which have already been built and 
tested. The reliability of each is as follows: R x = 0.95, R 2 = 0.95, /? 3 = 0.98. The reliability of the overall 
system must be equal to or greater than 0.85. What will be the minimum acceptable reliability of subsystems 
4 and 5 to ensure 85-percent reliability? 

A. 0.92 B. 0.95 C. 0.98 


'Answers are given at the end of this manual. 
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5a. A launch vehicle test program consists of 20 test firings requiring 90-percent reliability. Five tests have 
already been completed with one failure. How many additional successes must be recorded to successfully 
complete the test program? 

A. 13 B. 14 C. 15 

5b. Based on the probability (four successes in five flights), what is the probability of achieving successful 
completion of the test program? 

A. 0.04 B. 0.167 C. 0.576 

6. During individual tests of major launch vehicle subsystems, the reliability of each subsystem was found 
to be 


Subsystem 1 = 0.95 
Subsystem 2 - 0.99 
Subsystem 3 = 0.89 
Subsystem 4 = 0.75 

Since all subsystems are required to function properly to achieve success, what increase in reliability of 
subsystem 4 would be necessary to bring the overall system reliability to 0.80? 

A. 15 percent B. 20 percent C. 25 percent 

7. Solve for the following unknowm values: 

a. X = 750X10“^ failures/hr: t - 10 hr; R =? 

A. 0.9925 B. 0.9250 C. 0.9992 

b. A = 8.5 percent failures/10 3 hr; t = 3000 hr; R - ? 

A. 0.9748 B. 0.7986 C. 0.0781 

c. MTBF = 250 failures/hr; t - 0.5 hr; R = ? 

A. 0.9802 B . 0.9980 C. 0.9998 


d. R = 0.999; t = 10 hr; A=? 

A. 1000X10” 9 failures/hr B. 1 OX 10 -6 failures/hr C. 10 percent failures/ 10^ hr 

e. MTBF = ? 

A. 10 4 failures/hr B. 10 5 failures/hr C. 10 6 failures/hr 

8. The a priori MTBF prediction of a printed circuit board was 12.5X10 6 hr. Find the number of expected 
failures during a 10 8 -hr (accelerated) life test of 10 circuit board samples. 

A. 12.5 B. 80 C. 125 
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9a. Write the reliability equation for the battery activation success diagram shown below: 


If 


And 

And 

And 

Then 

Battery 

Passes 

Initiates 

Ignites 

Battery' 

Success 

activates 

umbilical 

EBW 1 

initiator 1 

activates 


command 

path 

(pan 3) 

(pan 5) 

(pan 1 ) 


(pan I) 

(part 2) 

or 

or 





EBW 2 

initiator 2 





(part 4) 

(pan 6) 




A. R s = R { R 2 (l -/? 3 /? 4 )(l -R 5 R 6 )R 7 B. R s = R } R 2 (\ -QiQ 4 )(\ -Q 5 Q 6 )R 7 

9b. If R = 0.9 for all series and R = 0.8 for all parallel parts, solve for R . 

A. 0.73 B. 0.26 C. 0.67 

10. A launch vehicle subsystem is required to be stored for 10 years (use 9000 hr = 1 year). If the subsystem 
reliability goal is 0.975, 

a. What A is required with no periodic checkout and repair? 

A. 2800X1 0 -9 B. 28X1 0 -9 C. 280X1 (T 9 

b. What A is required with checkout and repair every 5 years? (Assume 100-percent checkout.) 

A. 5600X1 0 -9 B. 56X1 0 -9 C. 560X10 -9 

c. What A is required with checkout and repair every year? (Assume 100-percent checkout.) 

A. 2800X 1 0 -9 B . 28X 1 0 -9 C. 280X 1 0 -9 
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Chapter 4 

Using Failure-Rate Data 

Now that you have a working knowledge of the exponential 
distribution e"^ and have the fundamentals of series and 
redundant models firmly in mind, the next task is to relate these 
concepts to your everyday world. To do this, we explore further 
the meaning of failure rates, examine variables that affect part 
failure modes and mechanisms, and then use part failure rate 
data to predict equipment reliability. We introduce a simple 
technique for allocating failure rates to elements of a system. 
The concepts discussed in this chapter are tools the designer can 
use for trading off reliability with other factors such as weight, 
complexity, and cost. These concepts also provide guidelines 
for designing reliability into equipment during the concept 
stage of a program. 

Variables Affecting Failure Rates 

Part failure rates are affected by ( 1 ) acceptance criteria, (2) all 
environments, (3) application, and (4) storage. To reduce the 
occurrence of part failures, we observe failure modes, learn 
what caused the failure (the failure stress), determine why it 
failed (the failure mechanism), and then take action to eliminate 
the failure. For example, one of the failure modes observed 
during a storage test was an “open” connection in a wet 
tantalum capacitor. The failure mechanism was end seal dete- 
rioration, which allowed the electrolyte to leak. One obvious 
way to avoid this failure mode in a system that must be stored 
for long periods without maintenance is not to use wet tantalum 
capacitors. If this is impossible, the best solution would be to 
redesign the end seals. Further testing would be required to 
isolate the exact failure stress that produces the failure mecha- 
nism. Once isolated, the failure mechanism can often be elimi- 
nated through redesign or additional process controls. 

Operating Life Test 

The tests involved 7575 parts — 3930 resistors, 1545 
capacitors, 915 diodes, 1080 transistors, and 105 transformers. 


One-third of the parts were operated at -25 °F, one-third at 
77 °F, and one-third at 125 °F. The parts, tested in circuits 
(printed circuit boards), were derated no more than 40 percent. 
The ordinate of the curve shows cumulative failures as a 
function of operating time. For example, at about 240 hours, the 
first failure was observed and at about 385 hours, the second. 
Several important observations can be made concerning failure 
rates and failure modes. 

Constant Failure Rater — Figure 4-1 shows that the failure 
rate for the first 1600 hr is constant at one failure every 145 hr. 
This agrees with the constant-/^ theory. Bear in mind that 
constant failure rate is an observation and not a physical law. 
Depending on the equipment, failure rates may decrease or 
increase for a period of time. 

Random Nature .— Notice that the failures in this constant- 
failure-rate region are random (in occurrence). For example, 
two diodes fail, then three transistors, then a silicon switch, then 
a diode, then a trimpot and a resistor, and so on. 

Repetitive Failures . — Figure 4-1 also shows that during the 
first 1600 hr, only two of these failures involved the same type 
of device. This is important because in most systems the 
problems that receive the most attention are the repetitive ones. 
It should be apparent in this case that the repetitive failures are 
not the ones that contribute the most to unreliability (failure 
rate); taking corrective action on the repetitive type of failure 
would only improve the observed failure rate by 18 percent. 

Failure modes . — Table 4-1 shows the observed failure 
modes (the way the failures were revealed) for the transistor, 
diode, and resistor failures given in figure 4-1 . In table 4-l(a), 
note that the short failure mode for transistors had an occur- 
rence rate five times that of any other mode. Note also that the 
eight transistor failures were distributed about evenly in the 
three environments but that some different failure modes were 
observed in each environment. 

Observe again in table 4-l(b) that the short failure mode for 
diodes occurred most frequently. The failures were not distrib- 
uted evenly in each environment, but a different failure mode 
occurred in each environment. 
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20 


- Test time 


Storage time 


15 


Intrinsic failure rate, 
1 failure/2300 hr ~\ 


' Infant mortality 
failure rate, 

1 failure/145 hr 


Transistor, short, 77 °F, 2N396 
Transistor, open, 125 °F, Mo 90 
Transistor, short, -25 °F, 2N498 
Transistor, short, -25 °F, Mo 90 
Transistor, leakage, -25 °F, 2N1057 
Resistor, tolerance change, 125 °F, metal film 
Trimpot, intermittent, -25 °F 
Diode, open, 125 °F, 1N483 
Selector switch, short, 77 °F, SA60A 
1 Transistor, intermittent, 125 °F, Mo 90 
Transistor, short, 125 °F, 2N1016B 
Transistor, short, 77 °F, 2N389 
Diode, short, 77 °F, 1N708A 
Diode, short, 77 °F, 1N761 




5 6 7 

Time, f, hr 


10 


Capacitor, electrolyte leak, wet tantalum 
Transistor, short, 2N389 
•j Transistor, tolerance, 2N335 


11 i 12x10 1 2 3 

I 


Sample size 


Resistors 

3930 

Capacitors 

1545 

Diodes 

915 

Transistors 

1080 

Transformers 

105 

Total 

7575 


Figure 4-1 . — Observed part failures versus test and storage time. 


Resistors failed in two modes (table 4-l(c)): one intermittent 
resistor at low temperatures and one tolerance failure at high 
temperatures. 

Burn-in . — As shown in figure 4-1 after 1 600 hr, the failure 
rate of the 7575 parts dropped by a factor of 7 for the remaining 
2900 test hours (3 failures per 2900 hr, failures 12, 13, and 14, 
as compared with 1 1 failures per 1600 hr). This is an example 
of what are commonly called bum-in failures. The first 11 
failures represent parts that had some defect not detected by the 
normal part screening or acceptance tests. Such defects do not 
reveal themselves until the part has been subjected to operation 
for some time. As mentioned earlier, eliminating the repetitive 
failure would only decrease the failure rate in the first 1600 hr 
by about 18 percent, but if screening tests were sensitive 
enough to detect all defects, the fai 1 ure rate would approach the 
intrinsic failure rate shown in figure 4-1 right from the start. 

In summary, some of the observed properties of operating 
failure rates are as follows: 

(1) For complex equipment, the intrinsic failure rate of 
electronic parts is usually constant in time. 

(2) Failures are random, with repetitive failures represent- 
ing only a small portion of the problems. 

(3) Failure modes of parts and equipment vary, depending 
on the operating environment. 


(4) Most parts have a dominant failure mode. For example, 
the dominant failure mode for semiconductors is shorting. 

(5) Rigid part screening and acceptance criteria can sub- 
stantially reduce operating failure rates by eliminating 
early failures. 

Storage Test 

After the operating test, the parts were put in storage for 
approximately 7000 hr (10 months) and then were retested to 
determine the effect of storage on parts. As shown in fig- 
ure 4-1 , three failures (14, 15, and 1 6) were observed at the end 
of the storage period. Note that the average failure rate observed 
in storage (one failure per 2300 hr) is close to the same rate 
observed in the previous 2900 hr of operation. Thus, it can be 
concluded that storage does produce part failures and that the 
storage failure rate may be as high as the operating rate. Industry 
is conducting a great deal of research on this problem because 
storage failure rates become a significant factor in the reliability 
of unmanned systems and affect considerably the maintenance 
policy of manned systems. 

Summary of Variables Affecting Failure Rates 

Fart failure rates are thus affected by 
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TABLE 4-1 . — FAILURE MODES 
(a) Transistors 


Observed 

pan 

failure 

mode 

Temperature 

op 

Total 

failures 

Observed 

failure 

rate, 

failures/hr 

-25 

77 

125 

Open 

_ _ 


MD-90 

1 

0.206/10" 

Short 

MD-90 

2N389 

2N1016B 

5 

1.03/ 10" 


2N498 

2N396 

— 



Intermittent 


— 

MD-90 

1 

.206/ 10 6 

Leakage 

2N1057 

— 


1 

. 206/10" 

Totals 

3 

2 

3 

8 

1.65/10'’ 


TABLE 4-3 — STRESS RATIOS THAT MEET 


ALLOCATION REQUIREMENT 


Part 

temperature, 

°C 

Stress ratio, W 

0.1 



0.2 

0.3 

0.4 

02> 

0.6 

Failure rate of derated part per I0 fl hr, 

30 





0.23 

0.22 

40 




0.24 



50 



0.24 




60 


0.25 





70 

0.25 







(b) Diodes 


Open 

Short 


1N761 

1N708A 

SA60A 

IN483 

1 

3 

0.24/10* 

.73/10* 

Totals 

0 

3 

1 

4 

0.97/10* 


(c) Resistors 


Intermittent 

Tolerance 

Trimpot 



1 

1 

0.06/10^ 

.06/10" 



Metal film 

Totals 

1 

0 

1 

2 

0.12/I0 6 


TABLE 4.-2. — FAILURE RATE CALCULATION 


(a) Tactical fire control station logic gate 


Component 

Stress ratio. W 

Number 

used, 

N 

Failure rate of 
derated part at 
40 °C 

v 

failures/lO^hr 

Application 
factor for 
vehicle, 
ground 
mounted. 

Total failure 
rate. 

x,.= n\,k a . 

failures/IO" hr 

Resistor, composition (2000 £2) 

0.5 

1 

0.0035 

10 

0.035 

Resistor, composition (180 000 Q) 

5 

1 

.0035 



.035 

Resistor, composition (22 000 Q) 

6 

1 

.0038 



.038 

Resistor, composition (6500 Q) 

5 

2 

.0035 



.070 

Transistor, germanium (PNP type) 

<1 W; 0.4 normalized 

1 

1.3 

8 

10.400 


junction temperature 






Diode, 1N31A 

3 

1 

3.5 

« 


17.500 

Total. X, = I X r = 29.68 


(b) Proposed logic gate 


Resistor, film (1300 Q) 

0.8 

1 

0.19 

0.3 

0.057 

Resistor.* film (3320 O) 

2 



.14 

3 

.042 

Resistor, film (46 600 £2) 

2 



.14 

3 

.042 

Transistor, silicon (NPN type) 

<1 W; 0.15 normalized 



.165 

8 

1.320 


junction temperature 


t 




Diode. 1N31A 

2 

5 

3.0 

5 

75.000 





Total, \ 

= XX, = 76.461 
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( 1 ) Acceptance criteria 

(2) All environments 

(3) Application 

(4) Age or storage 

To find ways of reducing the occurrence of part failures, we 
observe failure modes, learn what caused the failure (the failure 
stress), determine why it failed (the failure mechanism), and 
then take action to eliminate the failure. For example, one of 
the failure modes observed during the storage test was an 
“open” in a wet tantalum capacitor. The failure mechanism was 
deterioration of the end seals, which allowed the electrolyte to 
leak. One obvious way to avoid this failure mode in a system 
that must be stored for long periods without maintenance is not 
to use wet tantalum capacitors. If this is impossible, the next 
best thing would be to redesign the end seals. This would no 
doubt require further testing to isolate the exact failure stress 
that produces the failure mechanism. Once isolated, the failure 
mechanism can often be eliminated through redesign or addi- 
tional process controls. 

One of the best known methods of representing part failures 
is the use of failure rate data. Figure 4—2 (from ref. 4-1 ) shows 
a typical time-versus- failure-rate curve for flight hardware. 
This is the well-known ‘"bathtub curve,” which over the years 
has become widely accepted by the reliability community and 
has proven to be particularly appropriate for electronic equip- 
ment and systems. It displays the sum of three failure rate 
quantities: quality (QFR), stress (SFR), and wearout (WFR). 

Zone 1, the infant mortality period, is characterized by an 
initially high failure rate (QFR). This is normally the result 
of poor design, use of substandard components, or lack of 
adequate controls in the manufacturing process. When these 
mistakes are not caught by quality control operations, an early 
failure is likely to result. Early failures can be eliminated by a 
“burn-in” period during which time the equipment is operated 
at stress levels closely approximating the intended actual oper- 
ating conditions. The equipment is then released for actual use 
only when it has successfully passed through the bum-in 
period. For most well-described complex equipment, a 100-hr 
failure-free burn-in is usually adequate to cull out a large 
proportion of the infant mortality failures caused by stresses 
on the parts. 

Zone II, the useful life period, is characterized by an essen- 
tially constant failure rate (SFR). This is the period dominated 
by chance failures, defined as those failures that result from 
strictly random or chance causes. They cannot be eliminated by 
either lengthy bum-in periods or good preventive maintenance 
practices. 

Equipment is designed to operate under certain conditions 
and to have certain strength levels. When these strength levels 
are exceeded because of random unforeseen or unknown events, 
a chance failure will occur. Although reliability theory and 
practice are concerned with all three types of failure, the 
primary concern is with chance failures since they occur during 
the useful life of the equipment. Figure 4~2 is somewhat 


Equipment life periods 



Figure 4-2. — Hazard rate versus equipment life periods. 


deceiving because zone II is usually much longer than zone I 
or III. The time when a chance failure will occur cannot be 
predicted, but the likelihood or probability that one will occur 
during a given period of time within the useful life can be 
determined by analyzing the equipment design. If the probabil- 
ity of a chance failure is too great, either design changes must 
be introduced or the operating environment made less severe. 

The SFR period is the basis for the application of most 
reliability engineering design methods. Because it is constant, 
the exponential distribution of time to failure is applicable and 
is the basis for the design and prediction procedures spelled out 
in documents such as MIL-HDBK-217E (ref. 4-2). 

The simplicity of the approach (utilizing the exponential 
distribution, as previously indicated) makes it extremely attrac- 
tive. Fortunately, it is widely applicable for complex equipment 
and systems. If complex equipment consists of many compo- 
nents, each having a different mean life and variance that are 
randomly distributed, then the system malfunction rate becomes 
essentially constant as failed parts are replaced. Thus, even 
though the failures might be wearout failures, the mixed popu- 
lation causes them to occur at random intervals with a constant 
failure rate and exponential behavior. This has been verified for 
much equipment from electronic systems to rocket motors. 

Zone III, the wearout period, is characterized by an increas- 
ing failure rate (WFR) resulting from equipment deterioration 
due to age or use. For example, mechanical components, such 
as transmission bearings, will eventually wear out and fail 
regardless of how well they are made. Early failures can be 
postponed and the useful life extended by good design and 
maintenance practices. The only way to prevent failure due to 
wearout is to replace or repair the deteriorating component 
before it fails. 

Because modem electronic equipment is almost completely 
composed of semiconductor devices that really have no short- 
term wearout mechanism, except for perhaps electromigration, 
one might question whether predominantly electronic equip- 
ment will even reach zone III of the bathtub curve. 

Different statistical distributions might be used to character- 
ize each zone. Hazard rate has been defined for five different 
failure distribution functions Depending on which distribution 
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fte the hazard rare d,.a be*. 

“a ^ useM Me p'enod by fte exponential dismbnuon. and 
the wearou. penod by the log normal distnbntton. 

Part Failure Rate Data 

„ , s common in the held * 

integnty " [ehab !lS j" F| ™ „ en e ra l, part failure rales are 
between failures (MTB )■ o electrical stress as 

P r en ‘t fijure"^ TheftSly of curves on the graph 

regents different applied 

stress ratio or derating factor. F P ^ ratio 08) , 

at temperature A and is dera P h If lhe part 

,ha, part will have a failure me oU- 0 

- «?- ^ 

t^hr aUhough as indicated in chapter 3, other dimensions are 
used depending on who pubhstestheda^ th e 

The current authoritative fa.lure rate data p 
Department of Defense are m ML-JK 217BC ^ ^ 

The MIL-HDBK 21 serie ^ ‘ Th publications listed 

AGREE effort mentioned g of 

in table 1-1 and in references 4-3 .» 4-5 ' « based 

this effort to meet die nee^ fo^ ^ ^ £Xlsting and ne w 
part failure rates. . beino cenerated and ana- 

state-of-the-art parts are “^^^.^t' Vherefore. be sure to 
lyzed, failure rate handbooks do change. Theretore, 


SLT.rrt.—St----- 

simulate y° urdesl = nS h fallure rates are statistical, and 

there S£5S£ - - **"» “» * ^ * 

simple definition of failure rate. 


A = ■ 


lMn mher of observed failures 
Total operating time 


Obviously, if today we observe two^fadures ^ * at g 

tomorrow we f c “™^ a ^ n ^ a failure occurs in the next 1 -hr 
is two failures in l_4h . ’ 12 5 hr. Therefore, we 

period, the failure rate is failure rate is but we can 

rs r-" e 14 

Improving System Reliability Through 
Part Derating 

The best way to explain Imw tamalum capacimrs. 
an example. Constder two 20- v we •«» “ One 

b °‘ h b ' ^Sed SvTndtheothe, a. .2 V Fitst. find the 
:lml^r trn.ing-.o-m.ed ra .,0 for bo.h apphcnhons: 


l 

a 




temperature. 


Operating voltage 
Stress ratio = — 

Hence, one capacitor has a stress ratio of 1 .0. 

20 v _ i o 

Stress ratio = ” 1 u 


e other, a stress ratio of 0.6, 


Stress ratio - 


12 V 
20V 


= 0.6 


t . z 06is the same as “derating" the component 
Till™ HEaet Tl) table" for” M 1L-G-3965 
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Importance of Learning From Each 
Failure 

When a product fails, a valuable piece of information about 
it has been venerated because we have the opportunity to earn 
the product if we take the right aeons: 

Failures can be classified as 

( 1 ) Catastrophic (a shorted transistor or an open wire- wound 

(2 ) Degradation (change in transistor gain or resistor value) 

( (3 ) Physical wearout (brush wear in an electnc motor) 

These three failure categories can be subclassed further: 

( . k statistically independent (a shorted capacitor in a radio- 
(1> frequency amplifier being unrelated to a low-emission 

(9) ^asc^e'fthe^sh^rted^pacito^ ,n the radio-frequency 

C } aSlrcausingexcessivecurrentto^^ 

,or and burning the collector beam lead open) 

(3) Common mode (a circuit board used for primary cont 

of a process and a backup circuit board both burned °u 
by an over-voltage condition in a power supply 
feeds the two of them) 

On the basis of the following categories, much can be learned 
from lacSure that occurs during flight acceptance testing 
forT mission 'good failure reporting, conducting fa, lure ana y- 
ses“aininc a concurrence system, and taking corrective 

action. Failure analysis j ^dea^^vvdth^COTicur- 

Corrective action ensures that * ^ taken t0 avoid 

part ratings with the use stresses and to verify that p 
being used with a known margin. 

Failure Reporting, Analysis, Corrective 
Action, and Concurrence 

fhymtr needs. Keep your form simple and easy fo fill on., and 
get approval from management. 


Case Study— Achieving Launch Vehicle 
Reliability 

Design Challenge 

The launch vehrcle studied requires .he highest *»*to.ion 
and velocity and rheshorresl reaction rime of any developed. As 
such, .he design challenges were formidable 
environments include random vibration of 61 g s rms up 
3 kH° “chanical shock ar 25 000 gs peak (between 5 and 
^ OkHzfJdnear acceleration weh in excess of 
of 1 50 dB and aerodynamic heatin 0 up - 

development philosophy was that a vehicle be launched from a 
triJsilo With the initial design. Ahhouf m^y 

occurred during the 13-year development, the first flight tes 
vehiclewas no^ greatly different from the 70 now deployed. 


Subsystem Description 


The vehicle is launched from an underground silo, which also 
serves as a storage container during the multiyear design hfe. 

maintains the silo environment a, 80tl0 "F and 50 petcen. or 

16 The'vehicteriptSominantly in a power-off storage mode 

wh™ deeped in its silo. A periodic test of flight eleettontes .s 

conducted automatically every 4 weeks, in a 

life the flight electronics accumulate about P 

time and 43 830 hr of storage time. The ratio of storage time o 

operating time is nearly 240 000: 1 . 

Approach to Achieving Reliability Goals 

Reliability mathematical models were developed early in the 
rewarch and development program. From these models ,t wa 
apparent that the following parameters were the most importa 
in achieving the reliability goals: 

(1) Electronic storage failure rate during a multiyear design 

life (i.e., storage failures) M1 , _ 

(2) Percent testability of missile electronics (i.e., M 
STD-471 A, ref. 4-6) 

n) Periodic test interval for missile electronics 

(4) Severity of in-flight environments (acceleration, s oc , 

: o^rnrlvnamic heating) 


* 
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NaJionai Aeronautics and 
Space Administration 

Glenn Research Center 


A. Project Name 


Assy/CSCI Name 


PROBLEM REPORT # 


Procedure No. 
ID No. _ 


Page of 

(Ha rd ware Sof twa re ) 

Date Identified 

Location 


Type: 1. Eng/Qua! Process: t. Inspect 3. Design 5. Test -Type: 

2. Flight 2. Assemble 4. Code 

3. GSE 

B. Background Info. & Descriptions: (use continuation sheets as needed) 


Initiator Date 


C. Analysis/Root Cause/Effect on System (use continuation sheets as needed) 


Is damage assessment required? Yes (Is work sheet attached? Yes) 

Defect(s) info. (Name, ID, Lot code, Supplier, affected routines/sub-routines/programs, etc.) 

Defect Code: 

Problem Type: Nonconformance Failure Analyst — Date 

D. Disposition: Rewor k/Rewrite Repair/Patch Use as is Re turn Scra p Request Waiver 

E. Corrective Action: (use continuation sheets as needed) 


Initiated: Eng Chg Order t Software Chg Reg , Waiver Req , Request/Order # 

Project Eng OMS&A _ Reviewed on: / t 


F. Corrective Action Follow-up: / / By (name & title): 


G. Project Office Approval Signature(s) & Date 


OMS&A Approval Signature(s) & Date 


NASA-C-81 92 (Rev, 4-97) Page 1 of 2 


Distribution: Project Mgr. (orig.). OMS&A, Hardware File 


(Ref. PA1# 440) 


(a) 

Figure 4.6. — Failure report and analysis forms, (a) Problem report, (b) Damage assessment worksheet, (c) Defect codes. 
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INSTRUCTIONS (Please print/write legibly) 


Problem Report # — Unique number assigned by OMS&A PRACA Administrator. 

(Hardware__ Software ) — Analyst select 1 of 2 categories. 

Section A — To be completed by person who discovered the problem 
Project Name — Name or Acronym of project. 

Procedure No. — Title and/or No. of procedure/instructions used to carry out required task. 

Date Identified — Date when nonconformance is found or failure occurred. 

Assy/CSCI Name — Name of specific pkg., assy., sub-assy, or software pkg. with problem. 

ID No. — Part No., Serial No. if there are multiple parts of same design, or SCM# (SW Config. Mgnt. #). 

Location — Location where problem is identified, e.g. GRC, KSC, EMI Lab, Machine Shop, etc. 

Type — Choose 1 of 3 choices “Engineering/Qualification, Flight or Ground Support Equip." 

Process — Choose 1 of 5 choices “Inspection, Assembly, Design, Code or Test". 

TestTyp e — Applied for test processes only, eg. Burn-in, Vib., Thermal Cycle, Integration, Acceptance, etc. 

Section B — To be completed by person who discovered the problem 

Background Info & Descriptions How much operating time/cycles did the package have when the problem occurred? Record what 
was actually measured (actual data), and what it should have been (specifications); and which computer or micro, was running the 
software? 

Initiator — Name of person who initiate report. Date — Report date. 

Section C — To be completed by responsible Project Engineer/Analyst 

Analysis/Root Cause/Effect on System — Brief summary of analysis, describe root cause(s), and effect on system if root cause(s) is 
not eliminated. 

Defective Part(s) Info. — Record defective part(s) name, identification (P/N & S/N), model, lot code, supplier/manufacturer. 

Problem Type — Choose 1 of 2 choices “Nonconformance or Failure". 

Analyst — Name of analyst. Date — Analysis complete date. 

Section D — Responsible Project Engineer(s) will choose 1 of 6 disposition choices. 

Rework/Rewrite Correct hardware/software to conform to requirements (dwgs., specs., procedures, etc.). 

Repair/Patch — Modify hardware or patch software programs to usable condition. 

Use as is Accept hardware/software as is, without any modifications; or “Work around” - Software remains as is, but further action is 
required on operator or other systems. 

Return — Return to supplier for corrective action (rework, repair, replace, analysis, etc.). 

Scrap — Isolate defective material for details analysis, or discard un-usable material. 

Request Waiver — Initiate a Waiver Request for authorization to vary from specified requirements. 

Section E — Joint effort of Project Engr., OMS&A Rep. & Specialist(s) as needed 

Corrective Action Record specific actions required to eliminate problem(s) and prevent recurrence. Identify extent of software 
regression testing, and affected routines/programs, including any ECO# (Eng Chg Order), SCO# (Software Chg Request), and 
Waiver Request# initiated. 

Project Eng. — Responsible project engineer’s signature. 

OMS&A — Cognizant OMS&A representative’s signature. 

Reviewed on — Date when Corrective Action plan is reviewed, or Problem Review Board meets. 

Section F — To be completed by OMS&A Representative 

Corrective Action Follow-up Date when corrective action is verified. Assure approved waiver is attached if one has been 
requested. This will be the official “Problem Closure Date”. 

Verified By — Name of OMS&A Rep. who completed the follow-up. 

Section G — Approval Signature Requirements 

Problem identified during assembly/inspection — Required sign-off by Project Eng. & OMS&A Rep. 

Problem identified during test — Required signatures of Project Engineer, OMS&A Rep., Project Assurance Manager, and Project 
Manager. 


** Training on PRACA System is available through Assurance Management Office** 

NASA-C-8192 (Rev 4-97) Page 2 of 2 

(a) 


Figure 4.6. — (a) Concluded. 
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S33ES 


Figure 4-6. — (b) Damage assessment worksheet. 




ATTACHMENT 3.2.7 


DEFECT CODES 


INITIAL DEFECT (ONLY) CODES FINAL DEFECT CODES 


TEST FAILURE 

CODE 

CONTAMINATION 

CODE 

Component Select (Separate Test) 

101 

Fluid 

251 

Combined (POT/COAT) 

103 

Biological 

252 

POST (POT/COAT) 

105 

Corrosion 

254 

Performance/Functional 

106 

Particulate 

255 

Shock 

107 

Foreign Object 

256 

Thermal Cycle 

109 

Contaminated 

1006 

EMI/EMC 

111 



Burn-In 

113 

ELECTRICAL 


Pre (POT/COAT) 

115 



Vibration 

117 

Incorrect Mounting 

261 

Thermal Vacuum 

119 

Connector Damaged 

262 

X-Ray Examination Reject 

120 

Incorrect Lead Bend 

263 

Launch Site Test (Ground Equipment) 

122 

Unqualified Part 

264 

Acoustics 

123 

Short Lead 

265 

Continuity/Ground 

125 

Damaged Component 

266 

Launch Site Test (Airborne Equipment) 

126 

Long Lead 

267 

Engine Leaks 

127 

Burnt/Discolored 

268 

Leak Test 

128 

Lead/Wire Damaged 

269 

Model Survey 

131 

Wire Size Incorrect 

270 

Structural Load 

132 

Birdcaged 

271 

Thermal Balance 

133 

Crimping Incorrect 

272 

Pressurization 

134 

Insulation Damaged 

273 

Proof Pressure 

135 

Missing Part 

274 

Appendage Deployment 

136 

Polarity Incorrect 

275 

Phasing Test 

137 

Dirty Relay Contacts 

276 

Alignment Test 

138 

‘‘Routing Incorrect 

277 

Weight and CG 

139 

“Miswired 

278 



‘‘Other 

279 

SUSPECT 


“Wrong Part 

280 

NOTE: Temporary code must be changed 


Incorrect Reference Designators 

2028 

before final closeout. 




Suspect 

750 

MECHANICAL 


Suspect as a result of DC&R activity 

760 





Incorrect Part 

281 



Binding Stuck or Jammed 

282 



Dissimilar Metals 

283 



Excess Bonding 

284 



Holes Incorrect 

285 



Lack of Proper Lubrication 

286 



Insufficient Bonding 

288 



Interference 

289 



Bent Contacts/Pins 

290 



Misaligned 

291 



Missing Part 

292 



Improper Assembly 

293 



Safety Wire Items 

294 



Weight 

295 



Torque Values Incorrect 

296 



Part Damaged 

298 



Does Not Engage or Lock Correctly 

299 



Incorrect Dimensions 

2001 


(c) 


Figure 4-6. — (c) Defect codes. 
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DEFECT CODES (continued) 


FINAL DEFECT CODES (Continued) 


MECHANICAL (continued) 

CODE 

DIMENSIONAL (continued) 

CODE 

Location 

2002 

Burrs-Sharp Edges 

431 

Missing or Extra 

2003 

Threads 

432 

Insert 

2004 

Angle 

433 

Rework/Repair Damages 

2025 

Depth 

434 

Detail Out of Tolerance 

6001 



Layout 

6002 

DOCUMENTATION 


Bend Radius/Angle 

6003 



Made in Reverse 

6004 

Other Documentation 

450 

Undersize Machine/Grind 

6005 

Test Reports/Certs in Error/Not Complete 

452 

Incorrect Loft Lines Used 

6006 

Test Reports/Certs Not Received 

453 



Missing/Lost MARS 

455 

DAMAGE 


MARS in Error 

456 



Missing/Lost Process Plan 

457 

Packaging/Handling 

301 

incorrect Entry Process Plan 

458 

Launch 

303 

Process Plan Not To Latest DCN 

459 

During Fabrication 

305 

Q Codes (Other than Test Reports/Certs) 

470 

During Usage 

306 



During Transportation 

307 

PLASTICS 


During Test 

308 



Damage 

1009 

Improper Cure/Mix 

475 

Damaged PWB 

2046 

Delamination 

476 



Discontinuities (Holes/BlistersA/oids) 

477 

DIMENSIONAL 


Fiber Content 

478 



Flexural 

479 

Inside Dimension Distorted 

401 

Lap Shear 

480 

Incorrect Length 

402 

Exposed Circuitry 

482 

Inside Dimension Undersize 

403 

Incorrect Coating 

484 

Incomplete-Missing 

404 

Incorrect Bonding 

485 

Outside Dimension Distorted 

405 



Mislocated Feature 

406 

FINISH 


Outside Dimension Oversize 

407 



Surface Finish 

408 

Adhesion 

501 

Thickness Oversize 

409 

Blistered/Flaking 

502 

Outside Dimension Undersize 

410 

Color 

503 

Thickness Undersize 

411 

Cracked/Crazed 

504 

Incorrect Width 

412 

Incorrect 

505 

Inside Dimension Oversize 

413 

Pitted/Porous 

506 

Inside Diameter Undersize 

416 

No Samples 

507 

Inside Diameter Oversize 

417 

Rough/Irregular 

508 

Outside Diameter Undersize 

418 

Thickness 

509 

Outside Diameter Oversize 

419 

Scratched 

510 

Flatness 

420 



Straightness 

421 

IDENTIFICATION 


Roundness 

422 



Cylindricity 

423 

Incomplete 

551 

Perpendicularity 

424 

Incorrect 

552 

Angularity 

425 

Smeared/Illegible 

554 

Parallelism 

426 

Missing 

556 

Profile 

427 



Runout-Total Runout ( 

428 

MATERIALS PROPERTIES 


True Position 

429 





Chemical 

611 



Metallurgical 

612 



Improper Mix/Cure 

613 


(c) 


Figure 4-6. — (c) Continued. 
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DEFECT CODES (continued) 


FINAL DEFECT CODES (continued) 


MATERIAL PROPERTIES (continued) 

CODE 

MISCELLANEOUS 

CODE 

Heat Treat Material Response 

615 

Test Error 

860 

Mechanical 

617 

Process/Steps Missed 

861 

Voids/Inclusions 

618 

No Evidence of Source Inspection 

862 

Crack/Fracture 

619 

Procurement Error 

863 

Voids/Porosity/Inclusions/Cracks 

6013 

Destructive Physical Analysis (DPA) 

864 

Certification 

6014 

Reject 


Incorrect Material 

6015 

Particle Impact Noise Detection 

865 

Incorrect Dimensions 

6016 

(PIND) Reject 

870 

Chemical Composition 

6017 

Defective Tool/Test Equipment 

872 

Moisture Content 

6018 

Incorrect Assembling 

875 

Pot Life 

6019 

Integrity Seal Missing or Broken 

877 

Tensile Strength 

6020 

Intermittent Operations 

879 

Yield 

6021 

Launch Usage 

881 

Hardness 

6022 

Leakage 

883 

Cure Hardness 

6023 

Out of Calibration 

885 

Peel Strength 

6024 

Shipped Short 

886 



Burst/Ruptured 

887 

SOLDER 


Failed Due to Associated Equipment 

888 



Expanded (Normal Life) 

895 

Cold Joint 

702 

Time/Operational, Temperature 


Hole in Solder 

703 

Sensitive, Expirations 

1006 

Fractured Joint 

704 

Procedure Not Followed 

2026 

Pitted/Porous 

705 

Proof Test 

2027 

Insufficient 

706 

Missing Operation 

6025 

Excess Flux 

707 

Contamination 

6026 

Excess Solder 

708 

All Trailer Problems 

6027 

Solder/Ball Splash 

709 

Documentation/Certification Problems 

6028 

Dewetted Joint 

710 

History Jacket Problems 

6029 

Lifted Pads 

712 

Directed Rejection Item 


Measling 

713 



Insulation in Solder 

721 

WELDS 


Potential Short 

722 



Bridging 

723 

Cracks 

902 

Improper Tinning 

724 

Porosity 

903 

Manual Soldering Discrepancy 

2047 

Lack of Fusion 

904 

Machine Soldering Discrepancy 

2048 

Burn Through 

905 

Contaminated Joint 

2049 

Lack of Penetration 

906 

Corrosion/Oxidation 

2050 

Laps 

907 



Mismatch/Suck-In 

910 

NO DEFECT 


Location 

912 



Build Up 

921 

NOTE: This code to be used for 

755 

Craters 

922 

MARS closures where no 


Discoloration 

923 

discrepancies were identified. 


Fill-Up 

924 



Length 

925 



Preparation 

926 



Profile 

927 



Undercut 

928 



Oxidation 

2063 



Metal Expulsion 1 

2064 


(c) 


Figure 4-6. — (c) Continued. 
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DEFECT CODES (continued) 


FINAL DEFECT CODES (continued) 


ELECTRONIC/COMPUTERS 

CODE 

BO N Dl N G/COMPOSITI ES/POTTI N G 

CODE 

Faulty Program or Disk 

931 

Separation/Delamination 

2013 

Unable to Load Program 

932 

Improper Cure 

2014 

Nonprogrammed Halt 

933 

Incorrect Lay-Up/Set-Up 

2015 

Illegal Operation or Address 

934 

Test Specimen Failure Missing 

2016 

Computer Memory Error/Defect 

935 

Voids/Blisters/Bridging/Pits 

2017 

Input/Output Pulse Distortion 

936 

Damage 

2018 

Low Power Output 

937 

Mission Operation 

2051 

Frequency Out of Band, Unstable 

938 

Damaged 

2052 

or Incorrect 




Commercial Part Failure 

941 

CONNECTORS-COMPONENTS/EEE 


Communication/Transmission Line 

943 



Disturbance 


Exceeds PDA 

2041 

Externally Induced Transient 

945 

Outside of SPC Boundaries 

2042 



X-ray to Applicable MIL Spec 

2043 

COMPONENT LEAD WELDING 


Improper Testing 

2044 



Noisy Output 

2045 

(EMF only) 




Excessive Embedment 

950 

TOOLING FUNCTION 


Cracks 

951 



Voids 

952 

Incomplete Hardware 

6007 

Excessive Expulsion 

953 

Burrs 

6006 

Open/Missed Welds 

954 

Inadequate Structure 

6009 

Damaged Ribbon/Lead 

955 

Discrepant Drill Bushing 

6010 

Dimensions incorrect 

956 

Improper Insert/Bushing 

6012 

Sleeving Missing 

957 



Insufficient Heat/Cold Weld 

958 

FUSION WELDING 


Misrouted 

959 



Insufficient Fillet 

960 

Fusion Weld Defects 

2066 

Ribbon/Lead Misalignment 

961 



Ribbon/Lead Length Incorrect 

962 

TUBE/HOSE 


ASSEMBLY/INSTALLATIONS 


Damaged Flares/Lip Seals 

2005 



Incorrect Contours/Bends 

2006 

Parts Mismatched 

2019 

Wrong or Binding B-Nuts Sleeves 

2007 

Fastener Wrong or Damaged 

2020 

Dimensional 

2008 

Damaged or Missing Seals 

2021 

Expended 

2009 

Missing/Improperly Installed 

2022 

Damaged Braid 

2010 

Parts Missing/Wrong/Damaged 

2023 

Cracks 

2011 

Improper Configuration 

2024 





CHEMICAL/PLATING/LUBE/PAINT 


RESISTANCE WELDING 


Contamination 

2012 

Resistance Weld Defects 

2067 




Figure 4-6.— (c) Concluded. 
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Launch and Flight Reliability 

The flight test program demonstrated the launch and flight 
reliability of the vehicle. The ultimate flight program success 
ratio of 91 percent exceeded the overall availability-reliability 
goal by a comfortable margin. 


Field Failure Problem 

Twenty-six guidance sections failed the platform caging test 
portion of the launch station periodic tests (LSPT’s). These 
failures resulted in a major alarm powerdown. An investigation 
was conducted. 

Description of launch station periodic tests — The system 
test requirements at the site include a requirement for station 
periodic tests upon completion of cell or vehicle installation 
and every 28 days thereafter. LSPT’s check the overall system 
performance to evaluate the readiness of a cell. During an 
LSPT, the software initiates a test of the vehicle and ground 
equipment data processing system, and radar interfaces. Any 
nonconformance during an LSPT is logged by the data proces- 
sor and printed out, and the time from initiation of LSPT to 
failure is recorded. During an LSPT, the platform spin motor is 
spun up and held at speed for approximately 10 sec. After this, 
the system is returned to normal. 

An LSPT consists of two phases: 

(1) Spinup, a power-up phase to spin the gyros, align the 
platform, verify platform null, and check airborne 
power supply operation 

(2) A detailed test of airborne electronics in the radio- 
frequency test phase 

Initial failure occurrence. — Cell 3 on remote farm l (R 1 C3) 
experienced an LSPT failure (a major alarm powerdown) 
5.936 sec after “prep order,” the command to ready the vehicle 
for launch. The failure did not repeat during four subsequent 
LSPT’s. RIC3 had previously passed three scheduled LSPT’s 
before failure. A total of four cells on remote farms 1 and 2 
had experienced similar failures. Two of the failures occurred 
at 5.360 sec (an inverter test to determine if ac power is avail- 
able). Two occurred at 5.936 sec (caging test to determine if the 
platform is nulled to the reference position; see fig. 4—7). 

Replacement of failed guidance and control sections (G&C) 
28, 102, and 86 led to successful LSPT's. G&C 99, which failed 
only once during in-cell testing, was left on line. G&C’s 28, 
102, and 86 were ’med to Martin Marietta, Orlando, for 
analysis of the prt 1 failed condition. 

Failure verifier nd troubleshooting.— The test plan 
that was generate. nitted testing the failed G&C’s in a 
horizontal marriage test and a G&C test to maximize the 
probability of duplicating the field failures. Test results con- 
firmed site failures for both the caging null and the inverter null 
during a horizontal marriage test on G&C 1 02, a G&C level test 


Remote form 
time reference 
(RFTR) 1 


Expanded I 

below -s. I 



I system 

I ready 

0 1 2 3 4 5 6 


Launch sequencer clock, s 


System ready RFTR 



! L 5373.9 ms 5924.5 ms - 

5347.7 ms 5950.7 ms ’ 

5976.9 ms 1 ‘ 

6003.1 ms 1 

Figure 4-7.— System spinup tests. (Gate times are within 
± 50 ms of that shown because of data processor 
tolerances.) 


on G&C’s 28 and 86, and an autopilot level test on G&C 102. 
G&C 102 failed caging null four times and inverter null once at 
horizontal marriage. An evaluation of the inverter null failure 
revealed that a high caging amplifier output caused the launch 
sequencer level detector to become offset during inverter 
monitoring, resulting in the major alarm even though the auto- 
pilot inverter voltage was normal. Launch sequencer offset may 
or may not occur with an uncaged platform depending on the 
amplitude of the caging amplifier output when the inverter 
voltage is monitored. Therefore, both the inverter null and the 
caging null LSPT failures at site were attributed to failure of the 
platform to cage. 

An autopilot acceptance test tool was modified to permit 
monitoring of the platform spin motor voltage (800 Hz, 8 V, 
3 0) and the spin motor rotation detector (SMRD). During a 
spinup test on autopilot 69 (G&C 102), recordings indicated 
sustained caging oscillation. The SMRD showed no evidence 
of spin motor operation even though all autopilot voltages were 
correct, including the spin motor excitation voltage at the 
platform terminals. Further verification was obtained by listen- 
ing for characteristic motor noises with a stethoscope. 

G&C 86 failed the G&C level test because of caging null and 
inverter null alarms. Then, 3.5 sec into the third run, the caging 
loop stopped oscillating, but the platform did not cage in time 
to pass the test. The next run met all G&C test requirements. 
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It appeared obvious that the spin motor started spinning in the 
middle of the run. 

G&C 28 failed one run of the G&C level test; however, it met 
all requirements in the autopilot level test. This means that the 
spin successfully met its acceptance test procedure require- 
ments. A hesitation was noted during two of the seven spinup 
tests conducted. Platform 1 27 was heated to normal on the gyro 
test set. Its resistances were checked and found to meet speci- 
fication requirements. No attempt was made to start platform 
1 IT s spin motor at platform level . Both units were hand-carried 
to the subcontractor for failure analysis. The subcontractor was 
familiar with the construction of the platform and had the 
facilities to disassemble the platform without disturbing the 
apparently intermittent failure condition. 

Verification test conclusions .— Verification tests isolated 
the site LSPT failures to a failure of the platform spin motor to 
spin up, thereby causing major alarms at the inverter null or 
caging null gate. During testing, three of the first four failed 
platforms caged upon repeated application of voltage. Once the 
platform caged, the platform, autopilot, and G&C met all 
system test requirements. On the basis of these results, it was 
decided to repeat LSPTs up to 10 times after a site failure 
before removing the G&C. If the LSPT’s were successful, the 
G&C would be left on line. 

Measurements at platform level indicated the problem was 
internal to the platform and that all resistances and the platform 
temperature were correct. Subcontractor representatives 
reviewed the test results and concurred that the problem was 
internal to the platform. 

Mechanical Tests 

The spin motor breakaway torque was measured with a 
gram gase on platform 127 and was found to be normal 
(750 dyne cm). Dynamometer tests were performed on both 
platforms. The dynamometer is an instrument that measures 
rotation torque by slowly rotating the rotor of the spin motor 
while recording the stator rotational torque. The dynamometer 
is used during initial builds to establish the spin motor bearing 



Figure 4-8.— Platform dynamometer torque test. 


preload (torque). The spin motor generates approximately 
4000 dyne cm of starting torque with normal excitation voltage; 
800 dyne cm of this torque is used to overcome the inertia and 
frictional torque of the motor. 

Platform 140 was tested on the dynamometer and produced the 

torque peaks of 3400 and 3100 dyne cm shown in figure 4-8. 
The torque peaks were three revolutions apart. This is four 
times the normal running torque level for a new spin motor and 
about four times the torque level for this spin motor for the rest 
of its run. The torque increase lasted for about one-half a 
revolution and repeated within three revolutions. The spin motor 
bearings were cleaned and reassembled. Two large torque 
spikes of approximately 3000 dyne cm were observed on the 
first revolution. A 2200-dyne cm torque hump, one revolution 
in duration, was centered at the beginning of the second 
revolution. From these results, it was concluded that something 
in the spin motor bearing was causing an abnormal frictional 
load there. This result isolated the problem to the spin motor 
bearing area and eliminated the motor electrical characteristics 
as a contributor. 


Runup and Rundown Tests 

A series of tests were performed on spin motors 96 and 140 
to determine the effect of motor running time on spin motor 
start and running torque. Figure 4—9 shows the change in 
rundown time with a change in motor run time. 


Summary of Case Study 

Field problem cause.— The 26 LSPT failures at the site were 
caused by the failure of the G&C platform spin motors to spin 
up within 6 sec after the command to ready the vehicle for 
launch. It was determined that the spin motors did not start with 
the normal application of voltage. A polymer film had formed 
on the bearing surfaces during testing at 175 °F and caused the 
balls to stick to the outer race. This film was identified as one 
from the alkyl phenol and alkyl benzene families, and its source 
was determined to be uncured resins from the bearing retainer. 



Figure 4-9.— Rundown time versus motor run time. 
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Polymer film.— A film approximately 900 A thick had 
formed on the metal surfaces of the bearings of failed spin 
motors. The amount of material generated was - 1 0 -7 g/ball. To 
put this number in proper perspective, 2x10*^ g of oil is put on 
the bearing race during initial build, and 2xI0~ 3 g of oil is 
impregnated in the bearing retainer. 

Alkyl phenoi/alkyl benzene is a generic identification of a 
family of organic compounds. Further analysis identifies the 
major compounds in the family as phenol and methyl phenol 
(alkyl phenols) and toluene, xylene, and benzene (alkyl ben- 
zenes). A phenolic polymeric film would have the gummy, 
adhesive, and insolubility properties detected in the analysis. 
There is little doubt that the gummy fi lm detected was a phenol- 
based material. 

Source of phenol. — Phenols are used in three areas of the 
spin motor. A phenolic adhesive bonds the stator laminations 
together and bonds the hysteresis ring to the rotor. The bonding 
processes adequately cure the phenol to the point where uncured 
phenols would not be present. Also, the stator laminations are 
coated with epoxy after bonding. The remaining source is the 
paper phenolic retainer, which serves as a spacer and a lubrica- 
tion source for the spin motor bearings. Mass spectral analysis 
of the retainers yielded spectra essentially identical to the 
spectrum of the coating on the failed bearings. The conclusion 
of this analysis is that the source of the phenolic is uncured 
phenolic resins or resin compounds in the retainer. 

Retainer processing. — The retainer material is manufac- 
tured to military specifications by a vendor and is screened to 
tighter vehicle requirements for specific gravity. There is no 
specific requirement concerning uncured resins in the retainer 
material. The vendor estimated an upper limit of 1 percent of 
uncured resin in the retainer raw material. One percent would 
provide 3x1 0~ 5 g of uncured resins, more than sufficient to 
cause the spin motor problem. 

The finished retainer material is cleaned by an extraction 
process with benzene or hexane. This process does not remove 
a significant amount of uncured resins. Therefore, if uncured 
resins survive the vendor processing, they will remain in the 
uncured state in the installed retainers. 

Mechanism of film formation — It is theorized that the 
uncured resins are transferred from the retainer to the bearing 
surfaces through the natural lubricating process of the retainer. 
Running the spin motors generates centrifugal forces that sling 
the excess oil off the rotating surfaces, leaving a thin film of oil. 
The force of gravity during subsequent storage of the motor 
causes the already thin film to become thinner on the top 
surfaces and thicker on the lower surfaces. This redistribution 
process involves only the oil and leaves more viscous contami- 
nants in place. Subsequent running of the motor will cause 
replacement of oil on the oil-free surfaces. The source of the 
replacement oil is the retainer capillaries. This replacement 
process will cause the oil to bring any uncured phenolics to the 


surface of the retainer. The metal surfaces will then become 
lubricated with oil containing a small percentage of uncured 
resins. Subsequent storage cycles and running will continue 
this redistribution process, steadily increasing the phenolic 
concentration. Exposure to a temperature of 175 °F and 
extended operational maintenance gradually cure these 
phenolics in two stages. Initially, a highly viscous gummy 
residue is formed; finally, a hard, insoluble polymer film is 
formed on the metal surfaces. The film forms a bond between 
the balls and the races. The coating builds up to the point where 
the spin motor torque cannot overcome the bond at the initial 
power application. 

Extent of problem.— An analysis of failed and unfailed field 
units proved that not all platforms are susceptible to this failure. 
Obviously, a high percentage are susceptible, since 26 failures 
have been experienced. It is likely that many unfailed platforms 
contain some small percentage of uncured resins. 

The significantly higher failure rate in the units with higher 
serial numbers points to a process (or common) failure mode. 
All evidence points to lot-to-lot variations in the amount of 
uncured resins present in the retainer raw material. Traceability 
from retainer lot to individual platform spin motor was not 
possible in this case, but such records should be available. The 
26 units that have failed and the failure rate at the 14-day 
interval bound the total platform failure rate. The number of 
spares available is adequate to meet system life and reliability 
requirements. 

Site reliability.— The site system reliability goal allows 
approximately two G&C failures per month for any cause. 
Analysis of test data indicates the goal can be achieved at either 
a 7-day test interval (0.8 failure/month) or a 1 4-day test interval 
(1.5 failures/month). It cannot be achieved at a 2 1 -day interval 
(7.7 failures/month) or a 28-day interval (8.6 failures/month). 
Even though at least 74 percent of the site failures were 
restarted, a limited number of spare G&Cs are available. 

Tests at the site revealed that most failed spin motors can be 
restarted within 10 power applications and once started will 
perform properly. The site procedure was revised to leave any 
failed G&Cs that restart within 10 attempts on line. Platforms 
that did not start within 10 attempts were returned to the 
contractor and were restarted by repetitive application of 
overvoltage or reverse voltage up to the motor saturation limit. 
These data support the conclusion that the failure mode was the 
formation of a film bond on the race and that increasing the 
inverter output voltage to the motor saturation limit would not 
eliminate the problem. 

Current site operating procedures provide a 14-day LSPT 
interval with a 10-min run time. This enables the G&C failure 
rate to meet system reliability goals. The vehicle site is cur- 
rently being deactivated. If reactivation should be required, the 
repair of all defective or support platforms should be included 
as part of that effort. 
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Concluding Remarks 

Now that you have completed chapter 4, several concepts 
should be clear. 

( 1 ) The failure rate of complex equipment is usually consid- 
ered to be a constant. 

(2) Most failures are random, with repetitive failures repre- 
senting a small portion of unreliability. 

(3) The rate at which failures occur depends upon 

(a) The acceptance criteria, which determine 
how effectively potential failures are detected 

(b) All applied stresses, including electrical, mechani- 
cal, and environmental. (As these stresses increase, 
the failure rate usually increases.) 

(4) Published failure rate data represent the potential 
failures expected of a part. The rate at which these 
failures are observed depends on the applied electrical 
stresses (the stress ratio) and the mechanical stresses 
(the K a factor). 

(5) In general, failure rate predictions are best applied on a 
relative basis. 

(6) Failure rate data can be used to provide reliability 
criteria to be traded off with other performance para- 
meters or physical configurations. 

(7) The reliability of a device can be increased only if the 
device’s failure mechanisms and their activation causes 
are understood. 

In addition, you should be able to use failure rate data to 
predict the failure rate expected of a design, and consequently, 
to calculate the first term, P c , of inherent reliability. Finally, 
you should be able to allocate failure rate requirements to parts 
after having been given a reliability goal for a system or the 
elements of a system. 
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Reliability Training 1 

la. Using the failure rate data in table 4-4, calculate the flight failure rate for a launch vehicle electronic subsystem consisting 
of the following parts (assume K A = 1000): 


Component 

Number 
of parts, 
N 

Resistor, G657 109/10 

5 

Resistor, variable, 1 1 176416 

1 

Capacitor, G6571 13 

3 

Diode, G6557092 

3 

Transistor, 1 1 176056 

4 

Integrated circuit, analog, 1 1 177686 

1 


A. 195 failures per 10 9 hr B. 195 000 failures per 10 9 hr C. 195 000 failures per 10 6 hr 

lb. Assume the flight failure rate for this circuit is 500 000 failures per 10 9 hr. Calculate the reliability of the circuit for a 0 01-hr 
flight. 

A. 0.9995 B. 0.99995 C. 0.999995 

2. The a posteriori flight failure rate of a launch vehicle is 440 000 failures per I0 9 hr. 

a. If the storage failure rate is 0.3 of the operating rate, how long can the vehicle be stored with a 90.4 percent probability of no 

failures? J 

A. 30 days B . 40 days C. 50 days 

b. After 1450 hr (2 months) in storage the vehicle is removed and checked out electronically. If the vehicle passes its electronic 
checkout and the checkout equipment can detect only 80 percent of the possible failures, what is the probability that the vehicle 
is good? (Ignore test time.) 

A. 0.962 B. 0.858 C. 0.946 

3. A subassembly in a piece of ground support equipment has a reliability requirement of 0.995. Preliminary estimates surest that 
the subassembly will contain 300 parts and will operate for 200 hr. What is the average part failure rate required tournee t the 
reliability goal? 

A. 25x1 0 -6 B. 16 667X10' 9 C. 83xl0~ 9 

4. A piece of ground support equipment has a reliability goal of 0.9936. It contains four subassemblies of approximately equal risk. 

a. What is the allocated reliability goal of each of the four subassemblies? 

A. 0.99984 B. 0.9984 C. 0.9884 

b. Allocating further into subassembly 1, assume the goal is 0.998. Solve for the average part failure rate given the following: 

Estimated parts count: 100 
Estimated operating time: 10 hr 

A. 20 OOOx icr 9 B . 2000x 1 0“ 9 C. 200x 1 0~ 9 


'Answers are given at the end of this manual. 
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TABLE 4—4. — SELECTED LISTING— APPROVED ELECTRONIC 
FAILURE RATES FOR LAUNCH VEHICLE APPLICATION 1 


Part number 

Pan 

Operating 

mode* 

Nonoperating 

mode h 



Failure rate, failures/ 1 0 <J hr 


Integrated circuits 



1 1 177680/81/82/83/84/85 

Digital 

10 

3 

11177686 

Analog 

30 

10 


Transistors 

i — — i 

f 


6557155 

6557318/19 

6557046 

11176911 

11176056 

11177685 

6310038 

6557072 


Double switch 
Medium-power switch 
PNP type transistor 
Medium-power switch 
High-speed switch 
Field-effect transistor 
2N5201 

2N918 (unmatched) 


10 

20 

I 


10 

50 


Diodes 


6557061 

6557092 

6557123 

6557125 

11176912 


Rectifier and logic (5 V) 
Rectifier and logic (30 V) 
Rectifier and logic (50 V) 
Rectifier and logic (600 V) 
Rectifier and logic (400 V) 


20 

5 


Resistors 


6557018 

6557015 

6557016/17 

6557030 

6557031 

6557109/10 

6557329 

11176416 

2.5-W wirewound 
1/8- W wi rewound 
1- and 2-W wirewound 
1/10-W fixed film 
6-W wirewound 
1/4- W fixed composition 
1/8-W fixed film 
1-W variable metal film 

2 

3 

2 

1 

5 

1 

1 

50 

1 

2 

5 

5 

5 

2 

3 

10.3 

Capacitors 


G 65 7020/2 1/22 

Fixed glass 

0.1 

0.1 

G657 113/173 

Fixed ceramic 

5 

1 i 

G657114 

Fixed ceramic 

10 

1 

G657 119/120 

Solid tantalum 

2 

1 

G 657202 

Precision, fixed ceramic 

50 

3 

Relays 


11176326/453 

DPDT armature 

100 

20 

Transformers (RF) 


11301034/35/43/49 


10 

5 

1 1301064 


1 

5 

RF coil 


G657 140/41 


3 

2 

G657178/81 


10 

2 

RF filter 


G657189 


1 50 

5 


V-unreni iaiiure rate uaia arc avuuauic iium ^ 

^Applies to all slash numbers of parts shown (worst case shown). 
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Chapter 5 

Applying Probability Density Functions 


The inherent reliability of equipment is defined in chapter 3 


as 

Ri=c- Xl P t P w 

where 



probability of no failures 

e-^ 

probability of no catastrophic part failures 

P, 

probability of no tolerance failures 

2P* 

probability of no wearout failures 


Before discussing the P t and P w terms in the next chapter, it 
is necessary to understand probability density functions and 
cumulative probability functions. These concepts form another 
part of probability theory not discussed in chapter 2. First, in 
this chapter, the theory of density and cumulative functions is 
discussed in general; then the normal, or Gaussian, distribution 
is discussed in detail. This normal distribution is used exten- 
sively later in the manual. 

Probability Density Functions 

If a chance variable x can take on values only within some 
interval, say between a and b, the probability density function 
p(x) of that variable has the property that (ref. 5-1 ) 



In other words, the area under the curve p( x) is equal to unity. 
This is shown in figure 5-1. 


In the language of probability, the probability of x being 
within the interval (a,b) is given by 

rb 

P(a < x < b) = p(x) d x = 1 

Ja 

In other words the probability that x lies between a and b is 1 . 
This should beclear, since x can take only values between a and b. 

In a similar fashion, we can find the probability of x being 
within any other interval, say between c and d, from 

rd 

P(c < x < d) = J p(x) d x 
which is shown in figure 5-2. 

Example 1: Suppose we were to perform an experiment in 
which we measured the height of oak trees in a 1-acre woods. 
The result, if our measuring accuracy is ±5 ft, might look like 
the histogram shown in figure 5-3. 

The value at the top of each histogram cell (or bar) indicates 
the number of trees observed to have a height within the 
boundaries of that cell. For example, 19 trees had a height 
between 0 and 1 0 feet, 1 7 trees had a height between 1 0 and 20 
feet, and so on. The figure shows that 100 trees were observed. 

Now let us calculate values for the ordinate of the histogram 
so that the area under the histogram equals unity. Then, we will 
establish a probability density function for the tree heights. 
Since we observed 100 trees, it should be apparent that if the 
calculated ordinate of a cell times the width of the cell (the 
cell area) yields the percentage of 100 trees in that cell, the sum 
of the percentage in all cells will have to equal 100 percent. Of, 
if the percentages are expressed as decimal fractions, their sum 
will equal 1, which will be the total area under the histogram. 
Therefore, 
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Number of trees observed 


As a check, we can see that 


^ r „ Percent of trees in cell 

Ordinate of cell 

Width of cell 

For the cell 0 to 10 feet, which has 19 percent of the trees in it, 


19 1 

Ordinate of cell = x — = 0.019 

100 10 


Ordinate of cell = 0.019 x Cell width (10) = 0.19, or 19 percent 

In a similar fashion, the ordinates for the other cells can be 
calculated and are shown in table 5-1 and figure 5-4. 

The next step (fig. 5-4) is to draw a line through the midpoint 
of the cells. The equation for this line is called the probability 
density function p(x) and has the form 


Equation of curve p(x) 



Figure 5-1 . — Probability density function curve. 


Area under p(x) between x = c 
and x= d is probability that x lies 
between c and d 



Figure 5-2. — Application of probability density function. 


19 



Figure 5-3. — Height of trees observed in 1-acre woods. 


p( *) = -0.0002* + 0.02 


The area under the curve is (ref. 5-2) 


(•100 f 100 

Area= />(jc)d jc I (-0.0002* + 0.02) d* 
Jo Jo 


10 * 

10 4 

10 4 


- + 0 . 02 *| = - 


__(1 00 j — + 002(100) 


10* 


+ 2 = -1 + 2 = 1 


This agrees with our requirement that the area under a probabil- 
ity density function equal unity. 


TABLE 5- 1 —CALCULATION OF CELL 
ORDINATES FOR TREE DATA 


Cell 

Ordinate 

Area, 
cell width 



times cell ordinate 

0-10 

— — — = 0.019 
100x10 

0.19 

10-20 

ii 

o 

o 

-0 

.17 

20-30 

-4 = 0.015 
10 3 

.15 

30-40 

-2- = 0.013 
10 3 

.13 

40-50 

4- = 0.011 
10 3 

.11 

50-60 

—7 = 0.009 
10 3 

.09 

60-70 

~ = 0.007 
10 3 

.07 

70-80 

-4 = 0.005 
10 3 

.05 

80-90 

-4 = 0.003 
I0 3 

.03 

90-100 

47 = 0.001 
10 3 

.01 


Total area 

1.00 
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Figure 5-4.— Probability density function for tree heights. 


Application of Density Functions 

Now let us see how we can apply the density function to the 
tree data. To find the percentage of trees between 60 and 80 feet 
high, solve for 



Figure 5-5. — Probability density function for missile target 
miss distance. 


in the same woods. If we accept this assumption, we could then 
use our experience (the established density function) to predict 
the distribution of tree heights in an unmeasured acre. And this 
is exactly what is done in industry. 

As you can see, if we know what the density functions are for 
such things as failure rates, operating temperatures, and missile 
accuracy, it is easy to determine the probability of meeting a 
failure rate requirement for equipment (such as a missile) 
specified to operate in some temperature range with a required 
accuracy. 

Example 2: Suppose that a missile has a maximum target 
miss distance requirement of 90 feet and that after several 
hundred firings, the probability density function for miss dis- 
tance is 

p(x) = -0.0002 a + 0.02 where 0 < x < 1 00 


1*80 r80 

P(60 < a < 80) = p{x)dx= (-0.0002a + 0.02) d x 
J 60 ^ 60 

r 2 ,80 1 , , 

= — —r + 0.02* = T 80 2 - 60‘ + 0.02(80 - 60) 

10 4 < 60 1 0 4 1 ' 

= --^-(2800) + 0.4 = -0.28 + 0.4 


which is the same as the p(x) for the tree example and is shown 
in figure 5-5. 

To predict the probability that the next missile fired will miss 
the target by more than 90 feet, solve for 

r 100 

P(90 < a < 100) = J (-0.0002a + 0.02)dA 


= 0.12, or 12 percent 

Figure 5-3 shows that this answer is correct, since 12/ 100 trees 
were observed to have a height between 60 and 80 feet. 

Another way to look at this example is that there is only a 
1 2-percent chance that a tree picked at random from the 1 -acre 
area would have a height between 60 and 80 feet. In a similar 
fashion, we can calculate the probability that a tree would have 
any range of heights within the boundary of 0 to 100 feet. 

In the tree example, we were able to measure the trees in a 
particular part of the woods and to obtain a height density 
function for those trees. But what do we do if we are interested 
in a different area of woods and for some reason we are not able 
to go out and measure the trees? We would probably assume 
that the acre we measured was representative of all other acres 


.100 


ict 


■ + 0.02a 


1 90 


= -- ^-(l00 2 -90 2 ) + 0.02(100 -90) 

= _]9W + 002 ( 10 ) 

10 4 

= -0.19 + 0.2 = 0.01, or 1 percent 


In other words, there is a 99-percent chance that the missile 
will hit within 90 feet of the target and a 1 -percent chance that 
it will not. This is shown as the shaded area under the density 
function in figure 5-5. 
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Cumulative Probability Distribution 

Another practical tool in probability calculation is the cumu- 
lative probability distribution F(x) from reference 5-3. An F(x) 
curve for the tree example in the preceding section is shown in 
figure 5-6. The curve represents the cumulative area under the 
probability density function p{x ). The ordinates of the curve 
were calculated as shown in table 5-2. 

The cumulative curve can be used to solve the same problems 
that the density curve was used to solve. 

Example 3: Referring again to example 1, suppose that we 
want to know the probability that a particular tree selected at 
random from the woods will have a height between 30 and 
50 feet. 

Solution 3 A: Using the density function for tree height, 


TABLE 5-2 — ORDINATES FOR CUMULATIVE 
DISTRIBUTION OF TREE DATA 


Tree height, 
ft 

Area under 
p( x) curve 

Ordinate of p(x) curve 
(cumulative area) 

0-10 

0.19 

0.19 

10-20 

.17 

.36 

20-30 

.15 

.51 | 

30-40 

.13 

.64 

40-50 

.11 

.75 

50-60 

.09 

.84 

60-70 

.07 

.91 

70-80 

.05 

.96 

80-90 

.03 

.99 

90-100 

.01 

1.00 


r 50 

P(30 < * < 50) = (-0.0002* + 0.02) d x 

J30 


,50 


10 * 
1600 
10 4 


- + 0 . 02 * 


1 30 


■0.40 


= —0. 1 6 + 0.40 = 0.24, or 24 percent 


Solution 3B : Using the cumulative curve shown in figure 5-5, 

P { 30 < * < 50) = F(5 0) - F(30) = 0.75 - 0.5 1 
= 0.24, or 24 percent 


Note that in working out solution 3A, the next-to-last step 
(0.75 - 0.5 1 ) is the same as the next-to-last step of solution 3B . 
The reason for this is that the equation of the cumulative 
probability function F(x) is found from 

F (x) = jp(x)dx 


p(x)dx = F{b)-F{a) 
For the tree example 


which agrees with solution 3A. 



Figure 5-6. — Cumulative probability function for tree heights. 


F(x) = J (-0.0002* + 0.02) d * = - + 0.02* 

Consequently, we can find the probability of a variable* being 
within some interval by using the cumulative function F{x) 
even though the cumulative graph is not available. 

Example 4: What is the probability that a tree selected at 
random will have a height less than 20 feet? 

Solution 4 : 


r 20 

P{ 0 < * < 20) = p(x) d * = F(20) - F(0) 


10 * 


- + 0 . 02 * 


20 


10 


20 z 

10 4 


+ 0 . 02 ( 20 ) 


-0 


= -0.04 + 0.4 = 0.36, or 36 percent 


which agrees with a graphical solution. 

Some general rules for the use of the cumulative function 
F(*) are 
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Figure 5-7. — Cumulative distribution of tropic zone temperatures. 


(1) P(x<a) = F(a) 

(2) P(x > a) = 1 - F(a) 

(3) P{a < x < b) = F(b) - F{a) 


A 



Height, ft 


Figure 5-8— Histogram and density function for heights 
of children. 

measurements of an object or some physical phenomenon 
(ref. 5-4). 

Example 6: Assume that we need to measure the heights of 
eighth-grade children. A histogram of the children's heights 
would resemble the curve in figure 5-8. If, as in our tree example, 
we calculate an ordinate for the histogram so that the area under 
the histogram equals unity and then connect the midpoints of 
each cell, we obtain a smooth curve as shown i n figure 5-8. This 
curve represents the density function for the heights of the 
children. Such a curve (sometimes called a bell curve) is the 
shape of the normal distribution. We say that the children s 
heights are distributed normally. 


Example 5: Suppose that we would like to know the probabil- 
ity of equipment seeing tropic zone temperatures above 1 20 °F 
during operation because at or above 120 °F, we have to add a 
costly air-conditioning system to cool the equipment. If we 
could obtain the temperature data, we might find that the 
cumulative distribution for tropic zone temperatures would be 
that shown in figure 5-7. 

Solution 5: From the curve, the probability of observing a 
temperature at or above 120 °F is given by 


/>(temp > 1 20 0 F) = 1 - F( 1 20 ° F) = 1 - 0.97 
= 0.3, or 3 percent 

With only a 3-percent chance of temperatures above 1 20 °F, we 
probably would decide against air conditioning (all other 
parameters, such as failure rate, being equal). 


Normal Distribution 

One of the most frequently used density functions in reliabil- 
ity engineering is the normal, or Gaussian, distribution. A more 
descriptive term, however, is the normal curve of error because 
it represents the distribution of errors observed from repeated 


Normal Density Function 


The equation for the density function p(x) of the normal 
distribution is 


p( x ) 


\ -(x-xfna- 


This curve is shown in figure 5—9. The function p(x) has two 
parameters. The first is the mean x calculated from 


x = 


n 

^j X i where 
/=! 


where 

n total number of measurements or observations 
x t value of / th measurement 

The mean, therefore, is the arithmetic average of the measure- 
ments. From example 6, we would add all the heights observed 
and then divide by the number of children measured to obtain 
a mean or average height. The mean of all the children s heights 
from the data in figure 5-8 is 5.3 ft. 
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Standardized normal variable 


Figure 5-9. — Normal density function. 

The second parameter of p(x) is the standard deviation a 
calculated from 




of inflection on the curve. This is shown in figure 5-9. It is also 
shown that equal increments of the standard deviation can be 
laid out to the left (-) and the right (+) of the mean x. 

As you will recall, in determining probabilities from a 
density function, we need to calculate the area under the curve 
p{x). When using the normal density function, it is common 
practice to relate areas to the standard deviation. In general, for 
the area under the curve between the values of z and -z, 
standard deviations can be found from 


where 

x mean of measurements 
Xj value of I th measurement 
n total number of measurements 

Note that n — 1 is used in the equation to give an unbiased 
sampling distribution. In the general definition of O, n instead 
of n - 1 would be used. 

The standard deviation is the square root of the variance, 
which is denoted by o 2 . The magnitude of the variance, as well 
as the standard deviation, indicates how far all the measure- 
ments deviate from the mean. The standard deviation of the 
children’s height data, for example, is approximately 0.3 ft. If 
the range of heights observed had been from 5 to 5.6 ft, the 
standard deviation would have been approximately 0. 1 ft; with 
this standard deviation, the distribution would look squeezed 
together, as shown by the dashed curve in figure 5-8. However, 
the area under the dashed curve would still equal the area under 
the solid curve. 


Properties of Normal Distribution 

The normal density function is a continuous distribution 
from -°°to oo. It is symmetrical about the mean and has an area 
equal to unity as required for probability density functions. For 
the normal distribution, the standard deviation is the distance 
on the abscissa from the mean x to the intercept on the abscissa 
of a line drawn perpendicular to the abscissa through the point 


p(-z <x<z 

The areas for various values of z are shown in table 5-3. This 
table shows that the area under the normal curve between 1 a 
and -lex is 0.683, or 68.3 percent; the area under the normal 
curve between 2txand -2<xis 0.9545, or 95.45 percent, and so 
forth. 

Example 7: The term “3cr limit” refers to the area under the 
normal curve between 3 a and -3tx, which is 0.9973, or 
99.73 percent, as shown in table 5-3. Therefore, if a power 
supply output is defined as 28+3 V and the +3 V represents a 
3rr limit, 99.73 percent of all such power supplies will have 
an output between 25 and 31V. The percentage of supplies 
having an output greater than 3 1 V and less than 25 V will be 
1 - 0.9973 = 0.0027, or 0.27 percent, as shown in figure 5-10. 

Up to now we have been working with areas under the 
normal density function between integers of O. that is, 1 , 2, 3, 
and so on. In practice, however, we are usually interested in the 
area between decimal fractions of a, those being 1.1, 2.3, 
et cetera. We have also been using z to represent the number of 
standard deviations that a particular limit value is from the 
mean. For instance, in the power supply example, 25 V was 
given as being three standard deviations from the mean of 28 
V. It is better when working in decimal fractions of o to let 
z = {x — x)! a where x - x is the distance from the mean x to 
the limit value and a is the standard deviation. Going back to 
the supply example, our lower limit was 25 V, which was 3 V 
from the mean of 28 V, and the standard deviation was 1 V; 
therefore, z = (25 - 28)/l = -3. 


) — Area = 




: (7^2? I 
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TABLE 5-4. -AREAS IN TWO TAILS OF NORMAL CURVE AT SELECTED VALUES OF 2 

| From reference 5-1.) 



z 

0 

0.01 

0.02 

0.03 

0.04 

0.05 

0.06 

0.07 

0.08 

0.09 

0 

1.0000 

0.9920 

0.9840 

0.9761 

0.9681 

0 9601 

0.9522 

0.9442 

0.9362 

0.9283 

.1 

.9203 

.9124 

.9045 

.8966 

.8887 

.8808 

.8729 

.8650 

.8572 

.8493 


.8415 

.8337 

.8259 

.8181 

.8103 

.8026 

.7949 

.7872 

.7795 

.7718 

> 

.7642 

.7566 

.7490 

.7414 

.7339 

.7263 

.7188 

.7114 

.7039 

.6965 

.4 

.6892 

.6818 

.6745 

.6672 

.6599 

.6527 

.6455 

.6384 

.6312 

.6241 

.5 

.6171 

.6101 

.6031 

.5961 

.5892 

.5823 

.5755 

.5687 

.5619 

.5552 

.6 

.5485 

.5419 

.5353 

.5287 

.5222 

.5157 

.5093 

.5029 

.4965 

.4902 

.7 

.4839 

4777 

.4715 

.4654 

.4593 

,4533 

.4473 

.4413 

.4354 

.4295 

.8 

.4237 

.4179 

4122 

4063 

4009 

.3953 

.3898 

.3843 

.3789 

.3735 

.9 

.3681 

.3628 

.3576 

.3524 

.3472 

.3421 

.3371 

.3320 

.3271 

.3222 

1.0 

.3173 

.3125 

.3077 

.3030 

.2983 

.2937 

.2891 

.2846 

.2801 

.2757 

U 

.2713 

.2670 

.2627 

.2585 

.2543 

.2501 

.2460 

.2420 

.2380 

.2340 

1.2 

.2301 

.2263 


.2187 

.2150 

.2113 

.2077 

.2041 

.2005 

.1971 

1.3 

.1936 

.1902 

.1868 

.1835 

.1802 

.1770 

.1738 

.1707 

.1676 

.1645 

1.4 

.1615 

.1585 

.1556 

.1527 

. 1499 

.1471 

.1443 

1416 

.1389 

. 1 362 

1.5 

.1336 

.1310 

.1285 

.1260 

.1236 

.1211 

1188 

.1164 

.1141 

.1118 

1.6 

.1096 

.1074 

.1052 

.1031 

.1010 

.0989 

.0969 

.0949 

.0930 

.0910 

1.7 

.0891 

.0873 

.0854 

.0836 

.0819 

.0801 

.0784 

.0767 

.0751 

.0735 

1.8 

.0719 

.0703 

.0688 

j .0672 

.0658 

.0643 

.0629 

.0615 

.0601 

.0588 

1.9 

.0574 

.0561 

.0549 

j 

.0536 

.0524 

.0512 

.0500 

.0488 

.0477 

.0466 

2.0 

.0455 

.0444 

.0434 

.0424 

.0414 

.0404 

.0394 

.0385 

.0375 

.0366 

2.1 

.0357 

.0349 

.0340 

.0332 

.0324 

.0316 

.0308 

.0300 

.0293 

.0285 

-5 “> 

.0278 

.0271 

.0264 

.0257 

.0251 

.0244 

.0233 

.0232 

.0226 

.0220 

2.3 

.0214 

.0209 

.0203 

.0198 

.0193 

.0188 

0183 

,0178 

.0173 

.0168 

2.4 

.0164 

.0160 

.0155 

.0151 

.0147 

.0143 

.0139 

.0135 

.0131 

.0128 

2.5 

.0124 

. 012 ! 

.0117 

.01 14 

.0111 

.0108 

.0105 

.0102 

.00988 

.00960 

2.6 

.00932 ' 

.00905 

.00879 

.00854 

.00829 

.00805 

.00781 

.00759 

.00736 

.00715 

2.7 

.00693 

.00673 

.00653 

.00633 

.00614 

.00596 

.00578 

.00561 

.00544 

.00527 

2.8 

. 0051 ! 

.00495 

.00480 

.00465 

.00451 

.00437 

.00424 

.00410 

.00398 

.00385 

2.9 

.00373 

.00361 

.00350 

.00339 

.00328 

.00318 

.00308 

.00298 

.00288 

.00279 

z 

0 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

0.9 

3 

0.00270 

0.00194 

0.00137 

0 . 0'%7 

0, , 674 

0 . 0-'465 

0.0 ' 3 18 

0 . 0 ' 2 I 6 

0 . 0' 145 

0 . 0 4 962 

4 

. 0*633 

, 0 J 413 

CF 267 

0 -* l 7 l 

.CHIOS 

, 0 5 680 

. 0^4 2 2 

. 0*260 

. 0*159 

. 0*958 

5 

. 0*573 

. 0*340 

. 0*199 

0*1 16 

, 0 7 666 

0 7 380 

. 0 7 2 1 4 

0 7 I 20 

. 0*663 

. 0*364 

6 

, 0 s 197 

. 0*106 

. 0*565 

. 0' , 298 

O ' 1 135 

. 0 K >803 

. 0 ,(> 4 1 1 

, 0'"208 

, 0 10 105 

. 0 ,l 520 


Symmetrical Two-Limit Problems 

In this discussion the term “symmetrical two-limit prob- 
lems” refers to the area under the density function at equal 
values of z from both sides of the mean. The power supply 
example was this type, since we were concerned with the area 
between -3(7 and 3 <7 from the mean x. To work these prob- 
lems when z is a decimal fraction, we use tables of areas in 
the two tails of the normal curve. 


Table 5^1 shows tabulated areas in two tails of the normal 
curve for selected values of z from the mean x . For example, 
when z = 3.0, the table shows that 0.00270 of the total area 
lies in the two tails of the curve below -3(7 and above 3a. 
Because the curve is symmetrical, 0.00135 of the area will lie 
to the left of -3a and 0.00135 to the right of 3a. Note that 
this agrees with figure 5—10 for the power supply example. 

Example 8 ( using table 5-4): Suppose that a circuit design 
requites that the gain j3 of a transistor be no less than 30 and 
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P( 25 V < x< 31 V) = 99.73 percent 7 



-3a -2a -la x la 2a 3a 


Figure 5-10. — Probability density functions for power supply 
outputs. 


no greater than 180. The mean x of the /? density function of 
a particular transistor is 105 with a standard deviation of 32. 
What percentage of the transistors will have a f} within the 
required limits? 

Solution 8: 

Step 1 — Solve for z. 


x - x = 105 - 30 = 180 - 105 = 75 
Since <7 is given as 32, 


Step 2— From table 5^4, the area in the two tails when z = 2.34 
is 0.0193. Therefore, two tail tables 0.00965 of the transistors 
will have a /? below 30 and 0.00965 will have a /} above 180. 


P(30 < p <180) = 1 -0.0193 = 0.9807 7 



lie below lie above 1 1 

P = 30 p = 180 - 

Figure 5-1 1 . — Transistor gain. 


Step 3— Now find P ( 30 < /? < 180). Since 0.0193 of the 
transistors will have a fi below 30 or above 1 80. then 1 - 0.0 1 93 
must give the percentage that will lie between 30 and 180. 
This is I - 0.0193 = 0.9807, or 98.07 percent, as shown in 
figure 5-1 1 . If we were to buy 100 000 of these transistors, we 
would expect 98 070 of them to have a [} between 30 and 1 80. 
The remaining 1930 would not meet our /? requirements. 


One-Limit Problems 

In many applications, engineers are interested only in one- 
sided limits, an upper or lower limit, rather than a two-sided 
upper and lower limit. In these cases, they are interested in 
the area under one tail of the density function as shown in 
figure 5—12. Tabulated values of the area in one tail of the 
normal density function at selected values of z are given in 
table 5-5. 

Example 9: Suppose an exploding bridgewire (EBW) 
power supply is required to produce an output voltage of at least 
1500 V. At this output voltage or greater, all the bridgewire 
detonators will explode. If the mean output of all such supplies 
is known to be 1575 V and the standard deviation is 46 V, what 
is the probability that an output of 1500 V or greater will be 
observed? 




Figure 5-12. — Example of one-limit problems, (a) Lower limit, 
(b) Upper limit. 
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TABLE 5-5. -AREAS IN ONE TAIL OF NORMAL CURVE AT SELECTED VALUES OF z 

(From reference 5- 1 ) 



z 

0 

0.01 

0.02 

0.03 

0.04 

0.05 

0.06 

.0.07 

0.08 


0 

0.5000 


II 

0.4880 

0.4840 

0.4801 

0.4761 

0.4721 

0.4681 


.1 

.4602 

.4562 

.4522 

.4483 

.4443 

.4404 

.4364 

.4325 

.4286 


_2 

.4207 

.4168 

.4129 

.4090 

.4052 

.4013 

.3974 

.3936 

.3897 


.3 

.3821 

.3783 

.3745 

.3707 

.3669 

.3632 

.3594 

.3557 

.3520 

.3483 

.4 

.3446 

.3409 

.3372 

.3336 

.3300 

.3264 

.3228 

.3192 

.3156 

.3121 

.5 

3085 

.3050 

.3015 

.2981 

.2946 

.2912 

.2877 

.2843 

.2810 

.2776 

.6 

.2743 

.2709 

.2676 

.2643 

.261 1 

.2578 

.2546 

.2514 

.2483 

.2451 

.7 

.2420 

.2389 

.2358 

.2327 

.2296 

.2266 

.2236 

.2206 

.2177 

.2148 

.8 

.2119 

.2090 

.2061 

.2033 

.2005 

.1977 

.1949 

.1922 

.1894 

1867 

.9 

.1841 

.1814 

.1788 

.1762 

.1736 

.171 1 

.1685 

.1660 

.1635 

.1611 

1.0 

.1587 

.1562 

.1539 

.1515 

.1492 

1469 

.1446 

.1423 

.1401 

. 1 379 

1.1 

.1357 

.1335 

.1314 

.1292 

.1271 

.1251 

.1230 

.1210 

.1190 

.1170 

1.2 

.1151 

.1331 

.1112 

.1093 

.1075 

.1056 

.1038 

.1020 

.1003 

.0985 

1.3 

.0968 

.0951 

.0934 

.0918 

.0901 

.0885 

.0869 

.0853 

.0838 

.0823 

1.4 

.0808 

.0793 

.0778 

.0764 

.0749 

.0735 

.0721 

.0708 

.0694 

.0681 

1.5 

.0668 

.0655 

.0643 

.0630 

.0618 

.0606 

.0594 

.0582 

.0571 

.0559 

1.6 

.0548 

.0537 

.0526 

.0516 

.0505 

.0495 

.0485 

.0475 

.0465 

.0455 

1.7 

.0446 

.0436 

.0427 

.0418 

.0409 

.0401 

.0392 

.0384 

.0375 

.0367 

1.8 

.0359 

.0351 

.0344 

.0336 

.0329 

.0322 

.0314 

.0307 

.0301 

.0294 

1.9 

.0287 

.0281 

.0274 

.0268 i 

.0262 

.0256 

.0250 

.0244 

.0239 

.0233 

2.0 

.0228 

.0222 

.0217 

.0212 ! 

.0207 

.0202 

.0197 

.0192 

.0188 

.0183 

2.1 

.0179 

.0174 

.0170 

.0166 

.0162 

.0158 

.0154 

.0150 

.0146 

.0143 

2.2 

.0139 

.0136 

.0132 ! 

.0129 

.0125 

.0122 

.0119 

.0016 

.01 13 

.0110 

2.3 

.0107 

.0104 

.0102 

.00990 

.00964 

.00939 

.00914 

.00889 

.00866 

.00842 

2.4 

.00820 

.00798 

.0076 

.00755 

.00734 

.00714 

.00695 

.00676 

.00657 

.00639 

2.5 

.00621 

.00604 

.00587 

.00570 

.00554 

.00539 

.00523 

.00508 

.00494 

.00480 

2.6 

00466 

.00453 

.00440 

00427 

.00415 

.00402 

.0039 1 

.00379 

.00368 

.00357 

2.7 

.00347 

.00336 

.00326 

.00317 

.00307 

.00298 

.00289 

.00280 

.007272 

.00264 

2.8 

.00256 

.00248 

.00240 

.00233 

.00226 

.00219 

.00212 

.00205 

.00199 

.00193 

2.9 

.00187 

.00181 

.00175 

.00169 

.00164 

.00159 

.00154 

.00149 

.00144 

.00139 

z 

0 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

0.9 

3 

0.00135 

0.0*968 

0.0'687 

0.0'483 

0.0'337 

0.0*233 

0.0' 159 

0.0M08 

0.0*723 

0.0*48 1 

4 

.0 J 3 17 

.0*207 

OU33 

,0'854 

0'54 1 

.0*6340 

,0-'2 1 1 

O' 130 

.0*793 

.0*479 

5 

.0*287 

.0*170 

,0 7 996 

,0 7 579 

,0 7 333 

.0 7 190 

0 7 I07 

0 X 599 

.0*332 

.0*182 

6 

.0*987 

.0*530 

.(P282 

.(PI 49 

,0 m 777 

.0 ,(J 402 

,0'«206 

O'" 104 

.0"523 

,0 !t 260 


Solution 9: 

Step 1 — Calculate z. 

_ Mean limit _ 1575 - 1500 _ 75 _ ^ ^ 
a 46 46 

Step 2 — Find the area in one tail of the normal curve at z from 
the mean. From table 5-5 the tail area ate = 1 .63 from the mean 
is given as 0.05 16. Therefore, there is a 0.05 1 6 probability that 
an observed output will be below 1500 V. 


Step 3 — Find the probability that the output will be 1500 V or 
greater. Since from step 2 P(x < 1500) = 0.0516, 

P{x > 1500) = 1 - P{x < 1500) = 1 - 0.0516 
= 0.9484, or 94.84 percent 

We can therefore expect to obtain a 1 500- V output voltage level 
94.84 percent of the time. Or to express it another way, 
94.84 percent of the supplies will produce an output above the 
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1500 V 1575 V \ 


Probability that Probability that 

output will be output will be 

below above 

1500 V = 0.0516 1500 V = 0.9484 

Figure 5-13. — Exploding bridgewire power supply output. 



Figure 5-14. — Cumulative normal curve. 


minimum requirement of 1500 V. This result is shown in fig- 
ure 5—13. Associated with the probability density function p(x) 
of the normal distribution is a cumulative probability distribu- 
tion denoted by F(x). As shown in the integral formulas of 
chapter 2. the relation between the two is given by 

F{*) = jp(x)dx 
So, for the normal distribution 

F(x) = — f e - 1/2 [ { '-- l H 2 dx 

o-4ln J 


or in z notation 



A graph of F(x) is shown in Figure 5-14. Recall that in 
discussing cumulative functions earlier, F(x) was called the 
cumulative area under the density curve. Looki ng at figure 5- 1 4, 
then, you can see that 

( 1 ) F( x ) = 0.5, or that 50 percent of the area under the normal 
distribution is between and the mean x, or that there is a 
50-percent probability that a variable x lies in the interval 
(— °°. x ) 

(2) 1 - F( x ) = 0.5. or that 50 percent of the area under the 
normal distribution is between the mean x and oo; 0 r that 
there is a 50-percent probability that a variable jr lies in the 
interval ( x » °°) 

(3) The area between -1 a and x is 

P(-\c<x< x)=F(x)-F(-icj) 

= 0.5 - 0.16 = 0.34 

or that there is a 0.34-probability that a variable x will lie 
between the mean x and - la. 

For more accurate work, the cumulative areas for selected 
values of z have been tabulated and are shown in tables 5-6 
and 5—7. Table 5—6 shows the cumulative areas for values of z 
from -oo to 0, which are illustrated in figure 5-15. Table 5-6 
shows that 

( 1 ) At z = 0 (i.e., when the distance from the limit to x is 0), 
the cumulative area from -oo to x is 0.5000, or 50 percent 

(2) At z = -1.0, the cumulative area from -oo to -la is 
0.1587, or 15.87 percent 

(3) At z = -2.0, the cumulative area from — oo to —2a is 
0.02275, or 2.275 percent 

Table 5-7 shows the cumulative areas for values of z from 
0 to oo, which is illustrated in figure 5-16. 

In both tables the value of z is the same as F(x). It therefore 
follows that 

(1) The probability of the variable x lying between -oo 
and x is 

P(-°° < x < x) = F(x) - F(— oo) 

= F(z = 0)-F(z = -oo) 

= 0.5 - 0 = 0.5, or 50 percent 

(2) The probability of the variable x lying between -2.1a 
and 3.2ais 
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TABLE 5-6 — CUMULATIVE NORMAL DISTRIBUTION FROM r = -oo lo u 
| From reference 5-2 ] 



z 

0 

0.01 

0.02 

-0 

0.5000 

0.4960 

0.4920 

- .1 

.4602 

.4562 

.4522 

_ 2 

.4207 

.4168 

.4129 

- .3 

.3821 

.3783 

.3745 

- .4 

.3446 

.3409 

.3372 

- .5 

.3085 

.3050 

.3015 

- .6 

.2743 

.2709 

.2676 

- .7 

.2420 

.2389 

.2358 

- .8 

.2119 

.2090 

.2061 

- .9 

.1841 

.1814 

.1788 

- 1.0 

.1587 

.1562 

.1539 

- 1.1 

.1357 

.1335 

.1314 

- 1.2 

.1151 

.1131 

.1112 

— 1.3 

.09680 

.09510 

.09342 

- 1.4 

.08076 

.07927 

.07780 

- 1.5 

.06681 

.06552 

.06426 

- 1.6 

.05480 

.05370 

.05262 

- 1.7 

.04457 

.04363 

.04272 

- 1.8 

.03593 

.03515 

.03438 

— 1.9 

.02872 

.02807 

.02743 

- 2.0 

.02275 

.02222 

.02169 

- 2.1 

.01786 

.01743 

.01700 

- 2.2 

.01390 

.01355 

.01321 

- 2.3 

.01072 

.01044 

.01017 

- 2.4 

. 0-8198 

. 0-7976 

. 0*7760 

- 2.5 

. 0*6210 

(£6037 

.(£5868 

- 2.6 

. 0-4661 

.(£4527 

0-4396 

- 2.7 

. 0-3467 

. 0-’3364 

. 0-3264 

- 2.8 

. 0-2555 

.(£2477 

• U-2401 

- 2.9 

. 0-1866 

. 0*1807 

. 0-1750 

- 3.0 

. 0-1350 

.(£1306 

.(£1264 

- 3.1 

. 0*9676 

. 0*9354 

.(£9043 

- 3.2 

. 0 * 687 1 

, 0-'6637 

. 0*6410 

- 3.3 

. 0*4834 

. 0*4665 

. 0*4501 

- 3.4 

. 0**3369 

. 0*3248 

. 0*3131 

- 3.5 

. 0*2326 

. 0*2241 

. 0*2158 

- 3.6 

. 0**1591 

.(£1531 

. 0*1473 

- 3.7 

. a * 1078 

.(£1036 

. 0*9961 

- 3.8 

. 0*7235 

.(£6948 

. 0*6673 

- 3.9 

. 0*4810 

. 0*4615 

. 0*4427 

- 4.0 

. 0*3167 

. 0*3036 

. 0*2910 

- 4.1 

. 0*2066 

. 0*1978 

. 0*1894 

- 4.2 

. 0*1335 

. 0*1277 

. 0*1222 

- 4.3 

. 0*8540 

.(£8163 

. 0*7801 

- 4.4 

. 0*5413 

(£5169 

. 0*4935 

- 4.5 

. 0*3398 

(£3241 

. 0*3092 

- 4.6 

. 0*2112 

. 0*2013 

. 0*1919 

- 4.7 

. 0*1301 

. 0 s 1239 

. 0*1179 

- 4.8 

. 0*7933 

. 0*7547 

. 0*7178 

- 4.9 

. 0*4792 

.(£4554 

. 0*4327 

— oo 

0 

0 

0 


0.03 

0.04 

0.05 

0.06 

0.4880 

0.4840 

0.4801 

0.4761 

.4483 

.4443 

.4404 

.4364 

.4090 

.4052 

.4013 

.3974 

.3707 

.3669 

.3632 

.3594 

3336 

.3300 

.3264 

.3228 

.2981 

.2946 

.2912 

.2877 

.2643 

.261 1 

.2578 

.2546 

.2327 

.2297 

.2266 

.2236 

.2033 

.2005 

.1977 

.1949 

.1762 

.1736 

.1711 

.1685 

.1515 

.1492 

.1469 

.1446 

.1292 

.1271 

.1251 

.1230 

.1093 

.1075 

.1056 

.1038 

.09176 

.09012 

. 0885 1 

.08691 

.07636 

.07493 

.07353 

.07215 

.06301 

.06178 

.06057 

.05938 

.05155 

.05050 

.04947 

.04846 

.04182 

.04093 

.04006 

.03920 

.03362 

.03288 

.03216 

.03144 

.02680 

.02619 

.02559 

.02500 

.02118 

.02068 

.02018 

.01970 

.01659 

.01618 

.01578 

.01539 

.01287 

.01255 

.01222 

.01191 

. 0*9903 

. C£9642 

.(£9387 

. 0*9137 

. 0-7549 

. 0:7344 

. 0-7143 

.(£6947 

. 0-5703 

.(£5543 

.(£5386 

.(£5234 

. 0-4269 

.(£4145 

.(£4025 

.(£3907 

(£3167 

. 0:3072 

.(£2980 

. 0*2890 

. 0 : 2327 

. 0*2256 

. 0-2 1 86 

. 0*2118 

. O 2 I 695 

.(£1641 

CP 1589 

. 0*1538 

.(£1223 

(£1183 

.(£1144 

. 0*1107 

. 0*8740 

.(£8447 

. 0*8164 

. 0*7888 

. 0*6190 

. 0*5976 

. 0*5770 

. 0*5571 

.(£4342 

. 0*4 189 

. 0*4041 

. 0*3897 

.<£3018 

. 0*2909 

. 0*2803 

. 0*2701 

. 0*2078 

. 0*2001 

. 0*1926 

. 0*1854 

. 0 s 1417 

. O' 1 363 

. 0*1311 

. 0*1261 

. 0*9574 

. 0*9201 

.(£8842 

.(£8496 

. 0*6407 

. 0*6152 

.(£5906 

.(£5569 

. 0*4247 

. 0*4074 

.(£3908 

.(£3747 

. 0*2789 

. 0*2673 

.(£2561 

.(£2454 

. 0*1814 

. 0*1737 

.(£1662 

(£1591 

. 0*1168 

. 0*1118 

.(£1069 

.(£1022 

.(£7455 

.057124 

. O 5 6807 

. 0*6503 

. 0*4712 

.054498 

. 0*4294 

. 0*4098 

.052949 

.052813 

. 0*2682 

. 0*2558 

.051828 

.051742 

. 0*1660 

. 0*1581 

.051123 

, 0'1069 

. 0*1017 

. 0*9680 

. 0*6827 

. 0*6492 

. 0*6173 

. 0*5869 

. 0*4111 

. 0*3906 

. 0*3711 

. 0*3525 

0 

0 

0 

0 


0.07 

0.08 

0.09 

0.472 1 

0.4681 

0.4641 

.4325 

.4286 

.4247 

.3936 

.3897 

.3859 

.3557 

.3520 

.3483 

.3192 

.3156 

.3121 

.2843 

.2810 

.2776 

.2514 

.2483 

.2451 

.2206 

.2177 

.2148 

.1922 

.1894 

.1867 

.1660 

.1635 

. 161 ! 

.1423 

.1401 

.1379 

.1210 

.1190 

.1170 

.1020 

.1003 

.09853 

.08534 

.08379 

.08226 

.07078 

.06944 

.06811 

.05821 

.05705 

.05592 

.04746 

.04648 

.04551 

.03864 

.03754 

.03673 

.03074 

.03005 

.02938 

.02442 

.02385 

.02330 

.01923 

.01876 

.01831 

.01500 

.01463 

.01426 

.01160 

.01130 

.01101 

. 0*8894 

. 0*8656 

. 0*8424 

. 0*6756 

. 0-6569 

. 0 2 6387 

. 0*5085 

. 0-4940 

. 0:4799 

. 0*3793 

. 0-3681 

. 0*3573 

. 0*2803 

(£2718 

. 0:2635 

, 0*2052 

(£1988 

.(£1926 

. 0*1489 

.(£1441 

. 0*1395 

. 0*1070 

.(£1035 

.(£1001 

. 0*7622 

.(£7364 

, 0'7 1 14 

. 0*5377 

.(£5190 

. 0*5009 

. 0*3758 | 

. 0*3624 

. 0*3495 

. 0*2602 

. 0*2507 

. 0*2415 

. 0*1785 

.(£1718 

. 0*1653 

. 0*1213 

.(£1166 

. 0*1121 

.(£8162 

.(£7841 

. 0*7532 

. 0*5442 

. 0*5223 

. 0*5012 

. 0*3594 

. 0*3446 

. 0*3304 

. 0*2351 

. 0*2252 

. 0*2157 

.(£1523 

. 0*1458 

. 0*1395 

. 0*9774 

.(£9345 

. 0*8934 

. 0*6212 

.(£5934 

. 0*5668 

. 0*3911 

. 0*3732 

. 0*3561 

. 0*2439 

.(£2325 

. 0*2216 

. 0*1506 

. 0*1434 

. 0*1366 

. 0*9211 

.(£8765 

.(£8339 

. 0*5580 

(£5304 

.(£5042 

. 0*3348 

(£3179 

.(£3019 

0 

0 

0 
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TABLE 5-7.— CUMULATIVE NORMAL DISTRIBUTION FROM z = 0 lo oo 
| From reference 5-2. | 

Jk 


z 

0 

0.01 

0.02 

0.03 

0.04 

0 

0.5000 

0.5040 

0.5080 

0.5120 

0.5160 

. 1 

.5398 

.5438 

.5478 

.5517 

.5557 

2 

.5793 

.5832 

.5871 

.5910 

.5948 

.3 

.6179 

.6217 

.6255 

.6293 

.6331 

.4 

.6554 

.6591 

.6628 

.6664 

.6700 

.5 

.6915 

.6950 

.6985 

.7019 

.7054 

.6 

.7257 

.7291 

.7324 

.7357 

.7389 

.7 

.7580 

.7611 

.7642 

.7673 

.7703 

.8 

.7881 

.7910 

.7939 

.7967 

.7995 

.9 

.8159 

.8186 

.8212 

.8238 

.8264 

1.0 

.8413 

.8438 

.8461 

.8485 

.8508 

l.l 

.8643 

.8665 

.8686 

.8708 

.8729 

1.2 

.8849 

.8869 

.8888 

.8907 

.8925 

1.3 

.90320 

.90490 

.90658 

.90824 

.90988 

1.4 

.91924 

.92073 

.92220 

.92364 

.92507 

1.5 

.93319 

.93448 

.93574 

.93699 

.93822 

1.6 

.94520 

.94630 

.94738 

.94845 

.94950 

1.7 

.95543 

.95637 

.95728 

.95818 

.95907 

1.8 

.9W07 

.96485 

.96562 

.96638 

.96712 

1.9 

.97128 

.97193 

.97257 

.97320 

.97381 

2.0 

.97725 

.97778 

.97831 

.97882 

.97932 

2.1 

.98214 

.98257 

.98300 

.98341 

.98382 

2.2 

.98610 

.98645 

.98679 

.98713 

.98745 

2.3 

.98928 

.98956 

.98983 

.9=0097 

.9=0358 

2.4 

.9-1802 

.9=2024 

.92240 

.9=2451 

.9=2656 

2.5 

.9=3790 

.9=3963 

.9=4132 

.9=4297 

.9=4457 

2.6 

.9-5339 

.9=5473 

! .9=5604 

.9=5731 

.9=5855 

2.7 

.9-6533 

.9=6636 

.9=6736 

.9=6833 

.9=6928 

2.8 

.9=7445 

.9=7523 

.9=7599 

.9=7673 

.9=7744 

2.9 

.9=8134 

.9=8193 

.9=8250 

.9=8305 

.9=8359 

3.0 

.9=8650 

.9=8694 

.9=8736 

.9=8777 

.9=8817 

3.1 

.9=0324 

.9=0646 

.9*0957 

.9*1260 

.9*1553 

3.2 

. 9^3129 

.9*3363 

.9*3590 

.9*3810 

.9*4024 

3.3 

.9*5166 

.9*5355 

.9*5499 

.9*5658 

.9*5811 

3.4 

.9*6631 

.9*6752 

.9*6869 

.9*6982 

.9*7091 

3.5 

9*7674 

.9*7759 

.9*7842 

.9*7922 

.9*7999 

3.6 

.9*8409 

.9*8469 

.9*8527 

.9*8583 

.9*8637 

3.7 

.9*8922 

.9*8964 

.9*0039 

.9*0426 

.9*0799 

3.8 

.9*2765 

.9*3052 

.9*3327 

.9*3593 

.9*3848 

3.9 

.9*5190 

.9*5385 

.9*5573 

.9*5753 

.9*5926 

4.0 

.9*6833 

.9*6964 

.9*7090 

.9*7211 

.9*7327 

4.1 

.9*7934 

.9*8022 

.9*8106 

.9*8186 

.9*8263 

4.2 

.9*8665 

.9*8723 

.9*8778 

.9*8832 

.9*8882 

4.3 

.9*1460 

.9*1837 

.9*2199 

.9*2545 

.9*2876 

4.4 

.9*4587 

.9*4831 

.9*5065 

.9*5288 

.9*5502 

4.5 

.9*6602 

.9*6759 

.9*6908 

.9*7051 

.9*7187 

4.6 

.9*7888 

.9*7987 

.9*8081 

.9*8172 

.9*8258 

4.7 

.9*8699 

.9*8761 

.9*8821 

.9*8877 

.9*8931 

4.8 

.9*2067 

.9*2463 

.962822 

.963173 

.963508 

4.9 

.965208 

.9*5446 

.965673 

.965889 

.966094 

oo 

1.0 

1.0 

1.0 

1.0 

1.0 


0.05 

0.06 

0.07 

0.08 

0.09 

0.5199 

0.5239 

0.5279 

0.5319 

0.5359 

.5596 

.5836 

.5675 

.5714 

.5753 

.5987 

.6026 

.6064 

.6103 

.6141 

.6368 

.6406 

.6443 

.6480 

.6517 

.6736 

.6772 

.6808 

.6844 

.6879 

.7088 

.7123 

.7157 

.7190 

.7224 

.7422 

.7454 

.7486 

.7517 

.7549 

.7734 

.7764 

.7794 

.7823 

.7852 

.8023 

.8051 

.8078 

.8106 

.8133 

.8289 

.8315 

.8340 

.8365 

.8389 

.8531 

.8554 

.8577 

.8599 

.8621 

.8749 

.8770 

.8790 

.8810 

.8830 

.8944 

.8962 

.8980 

.8997 

.90147 

.91149 

.91309 

.91466 

.91621 

.91774 

.92647 

.92785 

.92922 

.93056 

.93189 

.93943 

.94062 

.94179 

.94295 

.94408 

.95053 

.95154 

.95254 

.95352 

.95449 

.95994 

.96080 

.96164 

.96246 

.96327 

.96784 

.96856 

.96926 

.96995 

.97062 

.97441 

.97500 

.97558 

.97615 

.97670 

.97982 

.98030 

.98077 

.98124 

.98169 

.98422 

.98461 

.98500 

.98537 

.98574 

.98778 

.98809 

.98840 

.98870 

.98899 

.9=0613 

.9=0863 

.9=1106 

.9=1344 

.9=1567 

.9-2857 

.923053 

.923244 

.923431 

.9*3613 

.9-4614 

.924766 

.924915 

.925060 

.925201 

.9-5975 

.926093 

.926207 

.926319 

.926427 

.927020 

.927110 

.927197 

.927282 

.927365 

.927814 

.927882 

.927948 

.928012 

.928074 

.928411 

.928462 

.928511 

.928559 

9-’8605 

.928856 

.928893 

.928930 

.928965 

.928999 

.9’ 1863 

.922112 

.9*2378 

.9*2636 

.9*2886 

.9*4230 

.924429 

.9*4623 

.9*4810 

.9*4991 

.925959 

.926103 

.9*6242 

.9*6376 

.9*6505 

9'7I97 

.927299 

.9*7398 

.9*7493 

.9*7585 

9’8074 

.9*8146 

.9*8215 

.9*8282 

.9*8347 

.9*8689 

.9*8739 

.9*8787 

.9*8834 

.9*8879 

.9*1158 

.9*1504 

.9*1838 

.9*2159 

.9*2468 

.9*4094 

.9*4331 

.9*4558 

.9*4777 

.9*4988 

.926092 

.9*6253 

.9*6406 

.9*6554 

.9*6696 

.9*7439 

.9*7546 

.9*7649 

.9*7748 

.9*7843 

9*8338 

.9*8409 

.9*8477 

.9*8542 

.9*8605 

.9*8931 

.9*8978 

.9*40226 

.9*0655 

.9*1066 

953193 

.953497 

.9*3788 

.9*4066 

.9*4332 

.9*5706 

.9*5902 

.9*6089 

.9*6268 

.9*6439 

.9573 1 8 

.9*7442 

.9*7561 

.9*7675 

.9*7784 

.95834 

.958419 

.9*8494 

.9*8566 

.9*8634 

,9-'8983 

.9*0320 

.9*0789 

.9*1235 

.9*1661 

.9 6 3827 

.9*4 13 1 

9*4420 

.9*4696 

.9*4958 

.9*6289 

.9*6475 

.9*6652 

.9*6821 

.9*6981 

1.0 

1.0 

1.0 

1.0 

1.0 
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Figure 5-15.— Cumulative areas for values of z from to 0. 



Figure 5-16— Cumulative areas for values of z from 0 to 


the probability that the signal will be delayed within the 
specified times? 

Solution 10: 

Step | — Fi n( j F(98 sec ) Since the mean is given as 95 sec and 
the standard deviation as 2.2 sec, 

_ Limit - Mean _ 98-95 _ _3_ _ ^ ^ 
a 2.2 “ 2.2 ” 


From table 5-7, 

F(98 sec) = F(z) = F(1 .36) = 0.9 1 309 

Step 2— Find F(90 sec). Since the mean is 95 sec and the 
standard deviation is 2.2 sec, 


90-95 

2.2 


-5 

2.2 


= -2.27 


Area = P(90 < x< 98) = 0.90149 7 



- 2 o -la x la 2 a 


Figure 5-17— Signal delay time. 


From table 5-6, 


F(90 sec) = F{z) = F(-2.27) = 0.01 1 60 
Step 3— Find P(90 <x < 98). From steps 1 and 2, 

P(90 < x < 98) = F(98) - F(90) = 0.9 1 309 - 0.0 1 1 60 
= 0.90149, or 90 percent 

There exists, therefore, a 90-percent probability that the signal 
will be delayed no less than 90 sec and no more than 98 sec, as 
shown in figure 5-17. 


Application of Normal Distribution to Test Analyses and 
Reliability Predictions 


P{-2.\o <x< 3.2cx) = F(3.2)- F(-2.1) 

= F(z = 3.2) - F(z = “2. 1) 

= 0.9993129-0.01786 
= 0.9814529, or 98 percent 

Nonsymmetrical Two-Limit Problems 

The cumulative function is useful for sol vingnonsymmetrical 
two-limit problems, which are in practice the most frequently 
encountered. 

Example 10: Suppose that a time-delay relay is required to 
delay the transmission of a signal at least 90 sec but no more 
than 98 sec. If the mean “time out” of the specific type of 
relay is 95 sec and the standard deviation is 2.2 sec, what is 


This section gives two examples of how the normal distribu- 
tion techniques may be applied to the analysis of test data of 
certain devices and how the results of the analysis may be used 
to estimate or predict the outcome of actual tests (ref. 5-5). 
Many similar examples are given in chapter 6. 

Example 11: For this two-limit problem, assume that a door 
hinge has a pin pull-force requirement of 1 2 ± 4.64 lb. Assume 
further that we have received 1 16 door hinges and have actually 
measured the pin pull-force required for 16 of them as part 
of an acceptance test. The results of the test are shown in table 
5-8 and in the histogram of figure 5-18. We now- w ant to apply 
normal distribution theory and then estimate what percentage 
of the remaining 100 door hinges will meet the pin pull-force 
requirement. 

Solution 11: 

Step 1— Solve for the mean of the test data x. We have already 
seen that 
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TABLE 5-8.— RESULTS OF 
DOOR HINGE 
ACCEPTANCE TEST 


Pull -force 
required, 
lb 

Number of 
occurrences 

8 

1 

10 

3 

12 

7 

14 

4 

16 

1 

Total 

16 


« 6 | 
ID 
O 

| 5 | 
| 4| 

I 31 

-Q 

I 21 


Lower 
acceptance 
limit (-2.32 a) 


' 1 percent 
will be 
■ defective 
here 



/ Area under density 
I function between 
/ acceptance limits, 
98 percent 


Upper 
acceptance 
limit (+2.32 a) 


r 1 percent 
] will be 
j defective 
here 


J LL 

2 4 6 8 10 12 14 16 18 20 22 

Pin pull-force, lb 

1 1 I I L 1 11 I 

-4a -3a -2a -la x la 2a 3a 4a 


Figure 5-18.— Door hinge test results. 


n 



n 


where 


Xj value of / th measurement 
n total number of measurements 

Let x = pound forces so that 


Step 2 — Solve for the standard deviation o. We have also seen 
that 



■ 1/2 


x ] = 8 

a 9 

= 12 

where 

=10 
A'^ — 10 

— O 

= 12 
= 12 

X 

observed mean 

o 

II 

X 12 

= 14 

X 

value of / th measurement 

x 5 = I2 

*13 

= 14 

n 

total number of measurements 

t 6 = 12 

*14 

= 14 



a- 7 =12 
- r 8 = 12 

*15 

*16 

= 14 
= 16 

n 

Solve for ^(a,_^) 2 : 

/-I 


and let n - 16 (number of occurrences). The mean x is 
therefore 



8 + 3( 1 0) + 7( 1 2) + 4( 1 4) + 1 6 
16 


= 1 2 lb (rounded to two places) 


/=1 M 

= (8-I2) 2 +3(10-12) 2 +7(12- I2) 2 
+4(14 - 1 2) 2 + (16 - 12) 2 

= (-4) 2 + 3(-2) 2 + 7(0) 2 + 4(2) 2 + (4) 2 
= 16+12 + 0+16 + 16 = 60 
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2>i-i 2 ) 2 

Then solve for ini 

n-1 


16 

1 = 1 

n - 1 


60 _ 60 
16-1 *“ 15 


Finally solve for cr 


a = 


16 

2>.- 12 ) 

i=i 


1/2 


n - 1 


= V4 =21b 


Step 3 With a mean of x = 12 lb and a standard deviation of 

<7= 2 lb, figure 5—18 shows that 

(1) The lower pull-force limit of 7.36 lb is z = 
(7.36 - 12)/2 = -2.32 standard deviations from the mean. 

(2) The upper limit of 16.64 lb is z = (16.64 - 12)/2 = 
2.32 standard deviations from the mean. 

Consequently, the percentage of door hinges that should fall 
within the 12 ± 4.64 -lb tolerance is given by 


Area above 147.6 °F is probability 
that output will not be greater than 
31 V at 147.6 °F and below: P= 0.96712-7 


Area below 147.6 °F is 



Figure 5-19 — Failure distribution of power supplies. 


defective: one would have a pull force less than 7.36 lb (the 
lower limit) and the other, a pull force greater than 16.64 lb (the 
upper limit). This is also shown in figure 5-18. 

However, considering the 16 door hinges to be actually 
representative of all such door hinges, we could predict that 
only 98 percent of such door hinges produced would meet the 
acceptance criteria of a 12 ± 4.64-lb pin pull force. 

Example 12: In this one-limit problem, 10 power supplies are 
selected out of a lot of 1 10 and tested at increasing temperatures 
until all exceed a maximum permissible output of 31 V. The 
failure temperatures in degrees centigrade of the 10 supplies 
are observed to be 


P(-2.32cr < x < 2.32a) = F(2.32) - F(-2.32) 

= 0.98983-0.01017 
(from tables 5 -6 and 5-7 ) 

= 0.97966, or 98 percent 

This says that 98 percent of the door hinges should fall within 
the 12 ± 4.64-lb tolerance and that 2 percent should be outside 
the required tolerance. However, none of the 16 samples was 
outside the tolerance. So where are the 2 percent that the 
analysis says are defective? The answer is that the 2 percent of 
defective door hinges are in the 100 not tested. 

We can make this statement by assuming that if we had tested 
all 100 door hinges, we would have expected to observe the 
same mean ( x = 12 lb) and standard deviation (a = 2 lb) that 
we did with the 16 samples. Note that this assumption is subject 
to confidence limits discussed in chapter 6. If we accept this 
assumption, we would expect to find 2 of the 100 door hinges 


x, =57 

x 6 =60 

x 2 = 65 

x 7 =75 

x 3 =53 

x 8 =82 

= 62 

x 9 =71 

x 5 =66 

x 10 = 69 

Find the probability that the remaining 

100 supplies will have 

an output greater than 31 V at 50 °C and below. 

Solution 12 : 


Step 1 — Solve for the mean x : 


10 


y x . 

“ ' 57 + 65 + 53 + 62 + 66 + 60 + 75 + 82 + 71 + 69 


__ ■ i J / T UJ i ~r \jl. -r w ■ > 

* =-= ir = io 

= 660 =6 6°C 
10 
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Step 2 — Solve for the standard deviation cr. First, 

10 

~ 66) 2 = (57 - 66) 2 + (65 + 66) 2 + (53 - 66) 2 

;= 1 

+ (62 - 66) 2 + (66 - 66) 2 + (60 - 66) 2 

+ (75 - 66) 2 + (82 - 66) 2 + (7 1 - 66) 2 + (69 - 66) 2 
= 81 + 1 + 169 + 16 + 0 + 36 + 81 + 256 + 25 + 9 
= 674 

Then 



= 8.7 °C (rounded to two places) 

Step 3 — Solve for z = (Limit - Mean (/o’. With an observed 
mean x = 66 and a standard deviation a = 8.7, the 50 °C 
limit is z = (50 - 66)/8.7 = - 16/8.7 = - 1.84 observation 
locations in standard deviations from the mean. 

Step 4 — Look at table 5-6 and find the cumulative area from 
-°° to a = - 1.84. This is given as 0.03288. Therefore, there 
is a 3.288-percent probability that the remaining 100 supplies 
will have an output greater than 31 V at 50 °C and below. 
This is shown in figure 5-19. 

Effects of Tolerance on a Product 

Because tolerances must be anticipated in all manufacturing 
processes, some important questions to ask about the effects of 
tolerance on a product are 

( 1 ) How is the reliability affected? 

(2) How can tolerances be analyzed and what methods are 
available? 

(3) How are tolerance failures affected? 

Electrical circuits are often affected by part tolerances 
(circuit gains can shift up or down, and transfer function poles 
or zeros can shift into the right-hand 5 -plane, causing oscilla- 
tions). Mechanical components may not fit together or 
may be so loose that excessive vibration causes failure 
(refs. 5-6 to 5-8). 


Notes on Tolerance Accumulation: A How-To-Do-It Guide 
General— The notation used in calculating tolerance is 
T tolerance 

G v standard deviation 

V dependent variable subject to tolerance accumulation 

x independent, measurable parameter 

1 .2,3,7? subscript notation for parameters 
/ generalized subscript (i.e., i = 1,2,3,. . for x { ) 

Tolerance is usually ±3<r. When in doubt, find out. Note 
that when Tis expressed in percent, always convert to engineer- 
ing units before proceeding. The mean or average is 
v = f{* j,T 2 ,Jr 3 ,. . The coefficient of variation is 

C v = (o/V) x 100 = percent. 

Worst-case method. — The worst-case method is as follows: 
V = /[(*, + 7 ] ).(* 2 + r 2 ),(* 3 + r 3 ), . . . ,(J„ + r„)] 

-v=f[(^-mx 2 -T 2 ),{x,-T,) (*„-?;)] 

Actually, 

±y = /[(*, ± 71 ),(x 2 ± t 2 ),(j 3 ± r 3 ), . . .,(*„ ± T n )\ 

where the plus or minus sign is selected for maximum V and 
then selected to give minimum V . If these ±V worst-case limits 
are acceptable, go no farther. If not, try the root-sum-square 
method. 

Root-sum-square method.— The root-sum- square method is 
valid only if thef(x's) are algebraically additive (i.e., when V 
is a linear function of the x’s); 

±V = V± 3d v , 

where 

2 2 2 2 2 
G v — CTj -I- O2 + CT3 + . . . + CT„ 

and 

c Jj =^r if 7} = ±3 a 

3 ' 
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Stated another way 


+V = V± 



V = f(x ly x 2 ,x 3 ) 

V = J| + ^2 + -*3 
7 = 3(7 


If these ±V root-sum-square limits are acceptable, go no 
farther. If they are not acceptable or the fix's) involve products 
or quotients, try the perturbation or partial derivative methods. 

Perturbation method.— The perturbation method is as fol- 
lows: 

+V=V±3o v 

where 



and where 

% = /[(*! ± <*\ ) - (h ± a 2 ) • (*3 ± « ) . • • ■ • {*n ± < <* J] 

The ±V limits are valid if C v = (<J V / V) x 100 < 10 percent. 

Partial derivative method — The partial derivative method 
is as follows: 

+V = V±3o v 

where 



The ±V limits are valid if C v = (<7 V / V) x 100 < 10 percent. 

Thus, four methods are available for estimating the effects of 
tolerance on a product. The worst-case method can be used on 
any problem. In those cases where the ±V worst-case limits are 
not acceptable, other methods can be tried. The root-sum- 
square method is usually valid if the functions are algebraically 
additive. The perturbation or partial derivative methods are 
valid only if the coefficient of variation is less than or equal to 
10 percent. 

Estimating Effects of Tolerance 

The following examples illustrate how these tolerance equa- 
tions can be used. Consider a stacked tolerance problem where 
the dependent variable is a linear function— three variables 
added to give V : 


where 

x { = 1 ± 0.1 mil 
J? = 2 ± 0. 1 mil 
*3 = 3±0. 1 mil 

Now, find V and the expected range of V 7 : 

V = 1 + 2 + 3 = 6 mils 

Using the worst-case method, with positive tolerance 

V + = (1 + 0.1) + (2 + 0.1) + (3 + 0.1) = 6.3+ 
and with negative tolerance 

VC = (1 - 0 . 1 ) + (2 - 0.1) + (3-0.1) = 5.7_ 

or 

V ± = 6 ±0.3 mil 

In the worst-case method, the tolerance on V (i.e., 0.3 mil) is 
worse than the 3(J V , tolerance. Tolerance can and often does 
cause fit problems and circuit problems. Therefore, in some 
cases we need to know what tolerance is acceptable. 

Using the root-sum-square method, 

V = 6 mils 

and 

CTi = — = 0.033 = a, =a 3 
1 3 “ 

a v =(°f +<t 2 +<y l) U ~ = ( 3cT ?)' / ' 

[3(0.033) 2 ] l/2 =0.0572 

3cr v =0.172 

so that 

V ± = 6±0. 172 mil 

In the root-sum-square method, the T value of 0. 1 72 is the 3o 
tolerance on V. 
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As a second example, consider a volume problem that has 
three variables in multiplication. Find V and the expected 
range of V: 


V = LWH = 10 ft x 5 ft x 2 ft = 100 ft 3 

First, convert percent tolerances to engineering units: 

L = 10 ft ± 10 percent = 10 ft ± 10 ft x 0.1 = 10 ft ± 1 ft 
W = 5 ft ± 10 percent = 5 ft ± 5 ft x 0. 1 = 5 ft ± 0.5 ft 
H = 2 ft ± 5 percent = 2 ft ± 2 ft x 0.05 = 2 ft ± 0. 1 ft 
T = ± 30 - 

Using the worst-case method. 


Checking the validity gives 


c v = 


£v 

V 



— 5 percent 


which is less than 10 percent. This solution is a better estimate 
of the effects of tolerance on volume. Note also that various 
values can now be estimated for different types of problems 
regarding this volume because it has been represented as a 
normal distribution function. 

Using the partial derivative method, again 


V ± = V± 3o\, 


where 


V+ =(10±l)x(5± 0.5) x (2 ± 0.1) = 1 1 x 5.5 x 2.1 
or 9x4.5x1.9 = 127 or 77 

The root-sum-square method cannot be used because these 
variables are not algebraically additive. Using the perturbation 
method. 


V = V±3a v 


where 



+ 



-.1/2 


V = LWH , — =WH 

dL 


9V 


= LH, 


dV 


= LW 


9VT dH 

a l ~ 0-33 ft, a w =0 . 17 ft, o H = 0.03 ft 



H + 


T t) 


-a 




-LL-L- 


I LW -V 


1 1/2 


= - = 0.33 ft 


a W = ~ = — = 0.17 ft 


°h = = - 0.03 ft 


0.1 


<r v . = {[( 1 0 + 0.33)(5)(2) - 1 oo] 2 +[(5 + 0. 1 7)( 1 0)(2) - 1 00] 2 


+[(2 + 0.03)(10)(5) - 100] 2 J 1/2 
= [(100.3- 100) 2 + (103.4 - 100) 2 +(101.5 - 100) 2 ] 

= (10.89 + 1 1.56 + 2.25) l/2 = V25= 5 


Ov=[{WH) 2 L ol +(LH) 2 w (jl+(LW) 2 H a 2 H ] U2 
= [(5 x 2) 2 (0.33) 2 + (1 0 x 2) 2 (0. 1 7) 2 
+ (I0x5) 2 (0.03) 2 j‘ /2 


= (10.9 + 1 1.6 + 2.25) 1/2 =V25 =5 
V = 100± 15 ft 3 


This method is more work and gives the same results as the 
perturbation method. Because the C v - 5 percent, which is less 
than 10 percent, the method would be suitable to use. 


Concluding Remarks 


V = V ±3a v = 100 ±15 ft 3 


Now that you have completed chapter 5, you should have a 
clear understanding of the following concepts: 


76 


NASA/TP — 2000-207428 


I 


( 1 ) A probability density function p(x) for a random variable 
describes the probability that the variable will take on a certain 
range of values. 

(2) The area under the density function is equal to unity, 
which means that the probability is 1 that the variable will be 
within the interval described by the density function. For 
example, the normal distribution describes the interval from 
— oo to 00 - 

(3) Associated with each probability density function is a 
cumulative probability distribution F(x) that represents the 
cumulative sum of the areas under the density function. 



x 2 


(a) Symmetrical two-limit problems, which are concerned with 
the probability of a variable taking on values within equal 
distances from both sides of the mean. 



-Z X z 


(b) Nonsymmetrical two-limit problems, which are similar to 
(a) but within unequal distances from both sides of the 
mean of the density function. 



x z 


(c) One-limit problems, which are concerned with the probability 
of a variable taking on values above or below some limit • 
represented by some distance from the mean of the density 
function. 


(4) The normal distribution (also called the bell curve, the 
Gaussian distribution, and the normal curve of error) is a 
probability density function. Using the normal distribution, 
you should be able to solve the following types of problems. 

(5) You should be able to take data measurements of a certain 
device and calculate the mean of the data given by 



and the standard deviation of the data given by 



and 

Xj — x 

o 

Using the data mean and standard deviation, you should then be 
able toestimate the probability of failures occurring when more 
of the same devices are tested or operated. 

(6) The worst-case method can be used on any problem: 

(a) Limits will be defined. 

(b) No estimates can be made from the population 
distribution. 

(7) The root-sum-square method only applies to algebraic 
variables that are additive. 

(8) The perturbation or partial derivative methods are only 
valid if the coefficient of variation is 10 percent or less. 
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Reliability Training 1 

1 . A unit is required to operate at 1 00 °F. If tests show the mean strength of the data for the unit is 1 23 °F and the standard 
deviation is 9 °F, what is the probability that the unit will operate successfully; that is, P(x > 100 °F)? 

A. 0.5234 B. 0.2523 C. 0.9946 D. 0.9995 

2. A pressure vessel (including a factor of safety) has an upper operating limit of 8000 psi. Burst tests show a mean 

strength of 9850 psi and a standard deviation of 440 psi. What is the probability of pressure vessel failure- that is 
P{x < 8000 psi)? 2 

A. 0.0 4 267 B. 0.0 4 133 C.O.O^n 

3. A memory drum is required to reach sink speed and stabilize in 15.5 sec at 125 °F. Five drums are tested with these 

stabilizing time results: 13.2, 12.3, 14.8, 10.3, and 12.9 sec. 

a. What is the mean stabilizing time? 

A. 13.1 B. 10.7 C. 12.7 

b. What is the standard deviation? 

A. 1.63 B. 1.45 C. 1.32 

c. What is the estimated percentage of drums out of specification; that is, P{x > 15.5 sec)? 

A. 6.7 B. 8.5 C. 4.3 

4. A pyrotechnic gyro has an uncaging time requirement of 142 ± 20 msec. Six gyros were tested resulting in these 
uncaging times: 123, 153, 140, 129, 132, and 146 msec. 

a. What is the mean uncaging time? 

A. 133.2 msec B. 135.2 msec C 137.2 msec 

b. What is the standard deviation? 

A- 10.2 B. 11.2 C. 11.9 

c. What is the estimated percentage of gyros within specification; that is, P{\22 <x< 162 msec)? 

A- 89.8 B. 96.8 C. 82.6 

5. A hydraulic pressure line was designed to the following stresses: 

(a) Maximum operating pressure (actual), 1500 psi 

(b) Design pressure (10-percent safety factor), 1650 psi 

Tests of the pressure line indicated a mean failure pressure of 1725 psi and a standard deviation of 45 psi. 

a. What is the reliability of the line when the design pressure limits are considered? 

A. 0.10 B.0.90 C. 0.95 


'Answers are given at the end of this manual. 

-The superscripted numbers in the answers are shorthand for 2.67XKT 6 
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b. What is the reliability of the line when the maximum operating pressure is considered? 

A. 0.99 B. 0.90 C. 0.80 

6. A communications network requires a 1300-msec watchdog delay after initiation. A sample of 10 delays was tested 
from a rack of 100 delays. The time delays of the circuits are as shown: 


Circuit 

number 

Delay, 

msec 

1 

1250 

2 

1400 

3 

1700 

4 

1435 

5 

1100 

6 

1565 

7 

1485 

8 

1385 

9 

1350 

10 

1400 


a. What is the average (mean) delay time? 

A. 1386 msec B. 1400 msec C. 1407 msec 

b. What is the standard deviation? 

A. 52.7 B. 87.1 C. 163.4 

c. On the basis of this sample, what percentage of the 100 circuits will meet specifications (1300-msec or greater 
delay)? 

A. 75 B. 80 C. 90 

7. A circuit contains four elements in series. Their equivalent resistance values are 


Element 

Nominal 
resistance, 
R , 
ohm 

Tolerance, 5 * 

r, 

percent 

A 

100 

±10 

B 

20 

±1 

C 

10 

±5 

D 

10 

±5 


‘Where ±7= ±3a 


a. What is the nominal or mean total resistance R T ? 

A. 120 Q B. 140 Q C. 160 Q 

b. What are the worst-case R values (upper number, maximum; lower number, minimum)? 

A. 131.6 Q B. 176.3 Q C. 151.2 Q 

1 18.7 Q 146.2 Q 128.8 Q 
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c. Using the root-sum-square method, what is the probability that R T > 135 Q? 

A. 0.905 B. 0.962 C. 0.933 

d. Using the perturbation method, what is the probability that R T > 135 Q? 

A. 0.905 B. 0.962 C. 0.933 

8. Given power (watts) = l 2 R, where / = 0.5 A, 7\ = +5 percent, R = 100 Q, and T R = ±10 percent. (Note: ±7'=±3a.) 

a. What is the nominal or mean power output P ? 

A. 25 W B.20W C. 30 W 

b. What are the worse-case P values (upper number, maximum; lower number, minimum)? 

A. 26.6 W B. 35.2 W C. 30.3 W 

18.2 W 22.6 W 20.3 W 

c. Using the perturbation method, what is the probability that (23.5 < P < 26.5)? 

A- 0.94 B. 0.80 C. 0.86 

d. What is the C v ( in percent) for the perturbation method used in question 8c? 

A. 12 B. 8 C. 4.6 

e. Is the root-sum-square method valid for solving the probability problem 8c? 

A. Yes B. No 

f. Using the partial derivative method, what is the probability that 23.5 < P < 26.5 ? 

A- 0.942 B. 0.803 C. 0.857 
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Chapter 6 

Testing for Reliability 

In chapters 3 and 4, we discussed the methods used to predict 
the probability that random catastrophic part failures would 
occur in given products and systems. These analytical tech- 
niques are well established (ref. 6-1). Yet, we should keep in 
mind that they are practical only when adequate experimental 
data are available in the form of part failure rates. In other 
words, their validity is predicated on great amounts of empiri- 
cal information. 

Such is not the case when we undertake similar analyses to 
determine the influence of tolerance and wearout failures on the 
reliability of a product. An understanding of these failure 
modes depends on experimental data in the form of probability 
density functions such as those discussed in chapter 5. In 
general, such data are unavailable on items at the part or system 
level; this kind of information must be developed empirically 
through reliability test methods. 

Chapter 6 reviews and expands the terms used in the reliabil- 
ity expression given in chapter 2 and then shows how the terms 
can be demonstrated or assessed through the application of 
attribute test, test-to- failure, and life test methods (ref. 6-2). 


Demonstrating Reliability 

Recall from chapter 2 that one way to define product reliabil- 
ity is as the probability that one or more failure modes will not 
be manifested (ref. 6-3). This can be written as 

R=P c P t P w (K q K m K r K t K u ) 

where 

P . probability that catastrophic part failures will not occur 
p probability that out-of-tolerance failures will not occur 
P probability that wearout failures will not occur 
K probability that quality test methods and acceptance 
criteria will not degrade inherent reliability 


K probability that manufacturing processes, fabrication, and 
assembly techniques will not degrade inherent reliability 
K r probability that reliability engineering activities will not 
degrade inherent reliability 

K e probability that logistics activities will not degrade 
inherent reliability 

K probability that user or customer will not degrade inherent 
reliability 

The term P c P t P w denotes inherent reliability R-; (K q K m K r KfK u ) 
are factors that affect the probability of the three modes of 
failure occurring during hardware manufacture and use rather 
than occurring from unreliable hardware design. 

First, we illustrate how the empirical value of these terms 
affects product reliability. Then, we discuss the particular test 
methods used to develop these values. Assume that a device 
was designed with a reliability requirement of 0.996. This 
means that only 4 out of 1000 such devices can fail. The device 
contains 1000 parts, it has a function to perform within a 
tolerance of X + 2 percent, and it must operate tor a mission 
cycle of 1000 hours at 50 °C. 

P c Illustrated 

If we know the number and types of parts in the device plus 
the applied stresses and part failure rates used in the exponential 
distribution, e”^^^ we can estimate the probability that no 
catastrophic part failure will occur during the mission cycle. 
Assuming, for example, that our estimate is P c =0.999 (i.e., one 
device in 1000 will incur a catastrophic part failure during the 
mission cycle), the product reliability of the device becomes 

R = P c P t P w (K - factors) = P,P W (K- factors) 

= 0.999 P,P W (K- factors) 
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P t Illustrated 

Suppose that we now test one of the devices at 50 °C. If the 
functional output is greater than the specified tolerance of 
X±2 percent, the reliability of that particular device is zero. It 
is zero because P t is zero (i.e., R = (0.999)(0)P H ,(A-factors) 
= 0). We can say, however, that the device will continue to 
operate in an out-of-tolerance condition with a probability of no 
catastrophic failures equal to 0.999 just as we predicted. To 
understand this, recall that part failure rates reflect only the 
electrical, mechanical, and environmental stresses applied to 
the individual parts. For this reason, a prediction on the basis of 
such data will neglect to indicate that (1) the parts have been 
connected to obtain a specified function, (2) a tolerance analy- 
sis of the function has been performed, or (3) the parts are 
packaged correctly. In other words, P c represents only how 
well the individual parts will operate, not how well the com- 
bined parts will perform. 

If nine more of the devices are tested at 50 °C with all the 
output functions remaining within the X± 2 percent tolerance, 
P t becomes 9/10 = 0.9 and the reliability of the device 
R = (0.999)(0.9)P Hr (A-factors). Because the reliability require- 
ment of the device is 0.996, it should be clear that P { must be 
greater than 0.996. Let us assume then that 1000 devices are 
tested at 147 °F with only one tolerance failure, which produces 
an observed P t = 999/ 1 000 = 0.999. The reliability of the device 
is now 

R = (0.999)(0.999)P H> (A - factors) = 0.998P*, (K - factors) 

Note that, because operating time is accumulated during origi- 
nal functional testing, it is possible for random catastrophic part 
failures to occur. Remember, however, that this type of failure 
is represented by P c and not P r 

P w Illustrated 

Now let us take another operating device and see whether 
wearout failures will occur within the 1 000-hour mission cycle. 
If, as run time is accumulated, a faulty function output or 
catastrophic failure is caused by a wear mechanism , the reli- 
ability of the device again becomes zero. It is zero because P 
is zero as shown in the equation 

R = (0.999)(0.999)(0)( A - factors) = 0 

Note the emphasis on the words “wear mechanism.” Because 
it is possible to experience random catastrophic part failures 
and even out-of-tolerance conditions during a test for wearout, 
it is absolutely necessary to perform physics-of-failure analy- 
ses. This is essential to ascertain if the failures are caused by true 
physical wear before including them in the P w assessment. 


So far, the first two terms, P c and P r combine to yield a 
probability of (0.999)(0.999) = 0.998. As a result, the remain- 
ing terms, P w . (A-factors), must be no less than 0.998 if the 
0.996 device requirement is to be satisfied. Therefore, we 
assume that we have demonstrated a P H . of 0.999, which 
reduces the device reliability to 

R = P c P t P w (K - factors) = (0.999)(0.999)(0.999)(A - factors) 
= 0.99J(K- factors) 

A'- Factors Illustrated 

Since testing obviously must be conducted on real hardware, 
the A-factors as well as the P terms of reliability are present in 
every test sample. Establishing values for the A-factors requires 
that all failures observed during a test be subjected to 
physics-of-failure analyses to identify specific failure mecha- 
nisms. Actually, the action taken to prevent the recurrence of an 
observed failure mechanism determines the factor that caused 
the failure. A failure that can be prevented by additional 
screening tests as part of the quality acceptance criteria is 
charged to the A factor; one that requires additional control 
over some manufacturing process is charged to the K m factor, 
and so on. Failures that require changes in documentation, 
design, and tolerance would be charged to the P c , P or P w 
terms as applicable. 

The least important aspect of testing is the ability to charge 
an organization or function with responsibility for a failure. 
More important is the need to prevent observed failures from 
recurring. This requires that corrective action be made a recog- 
nized part of each reliability test program. 

Getting back to the illustration, we assume that one failure 
out of 1000 devices was caused by one of the A-factors even 
though it could have been observed during a P , P r or P failure 
evaluation. This reduces the reliability of the device to 

R = P c P t P w {K - factors) = (0.999)(0.999)(0.999)(0.999) 

= 0.996 

which indicates that the device met its requirement. 

Test Objectives and Methods 

The purpose of the preceding illustration was to provide a 
better understanding of (1) how the P terms and the A-factors 
relate to physical hardware and (2) the techniques for demon- 
strating the terms through testing. Table 6-1 shows the sug- 
gested test methods. We say “suggested” because any of the test 
methods can be used if certain conditions are met (ref. 6-4). 
These conditions are pointed out as each method is discussed. 
Table 6-1 indicates the most efficient methods by assigning 
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TABLE 6-1. — TEST METHOD PRIORITIES 
FOR DEMONSTRATING RELIABILITY 


Reliability 

term 

Suggested test method 

Attribute 

tests 

Tests to 
failure 

Life 

tests 


2 

3 

1 

p. 

3 

i 

2 

A, 

3 

2 

1 

K - factors 

3 

1 

2 


priority numbers from 1 to 3 (with 1 being the most efficient and 
3 the least). 

Test Objectives 

At least 1000 test samples (attribute tests) are required to 
demonstrate a reliability requirement of 0.999. Because of cost 
and time, this approach is impractical. Furthermore, the total 
production of a product often may not even approach 1000 
items. Because we usually cannot test the total production of a 
product (called product population), we must demonstrate 
reliability on a few samples. Thus, the mam objective of a 
reliability test is to test an available device so that the data will 
allow a statistical conclusion to be reached about the reliability 
of similar devices that will not or cannot be tested. That is, the 
main objective of a reliability test is not only to evaluate the 
specific items tested but also to provide a sound basis for 
predicting the reliability of similar items that will not be tested 
and that often have not yet been manufactured. 

As stated, to know how reliable a product is, one must know 
how many ways it can fait and the types and magnitudes of the 
stresses that produce such failures. This premise leads to a 
secondary objective of a reliability test: to produce failures in 
the product so that the types and magnitudes of the stresses 
causing such failures can be identified. Reliability tests that 
result in no failures provide some measure of reliability but 
little information about the population failure mechanisms of 
like devices. (The exceptions to this are not dealt with at this 
time.) 

In subsequent sections, we discuss statistical confidence 
attribute test, test-to-failure, and life test methods, explain how 
well these methods meet the two test objectives, show how the 
test results can be statistically analyzed, and introduce the 
subject and use of confidence limits. 

Attribute Test Methods 

Qualification, preflight certification, and design verification 
tests are categorized as attribute tests (refs. 6-5 and 6-6). They 
are usually go/no-go and demonstrate that a device is good or 
bad without showing how good or how bad. In a typical test, 


tw o samples are subjected to a selected level of environmental 
stress, usually the maximum anticipated operational limit. If 
both samples pass, the device is considered qualified, preflight 
certified, or verified for use in the particular environment 
involved (refs. 6-7 and 6-8). Occasionally, such tests are called 
tests to success because the true objective is to have the device 
pass the test. 

An attribute test is usually not a satisfactory method of testing 
for reliability because it can only identify gross design and 
manufacturing problems. It can be used for reliability testing 
only when a sufficient number of samples are tested to establish 
an acceptable level of statistical confidence. 


Statistical Confidence 

The statistical confidence level is the probability that the 
corresponding confidence interval covers the true (but unknown) 
value of a population parameter. Such a confidence interval is 
often used as a measure of uncertainty about estimates of 
population parameters. In other words, rather than express 
statistical estimates as point estimates, it is much more mean- 
ingful to express them as a range (or interval), with an associ- 
ated probability (or confidence) that the true value lies within 
such an interval. 

It should be noted however, that statistical confidence inter- 
vals can be difficult to evaluate (see also refs. 6-4 and 6-9). For 
simple distributions in reliability, intervals and levels are 
calculated in a straightforward manner. For more complicated 
or multiparameter distributions, especially where parameter 
estimates are not statistically independent, such intervals and 
levels can be very difficult to calculate. 

To illustrate further the limitations of attribute test methods, 
we apply statistics to the test results. Figure A-4(a) in appendix 
A shows on the ordinate the number of events (successes) 

necessary to demonstrate areliability value (abscissa) for various 

confidence levels (family of curves) when no failures are 
observed. Figures A-4(b) to (f) provide the same information 
when one to five failures are observed. 

From the results of two devices tested with no failures, fig- 
ure A-4(a) shows that we can state with 50-percent confidence 
that the population reliability of such devices is no less than 
71 percent. Fifty-percent confidence means that there is a 
50-percent chance that we are wrong and that the reliability of 
similar untested devices will actually be less than 71 percent. 
Similarly, we can also state from the same figure that we are 
60 percent confident that the reliability of all such devices is 
63 percent. But either way, the probability of success is less 
than encouraging. 

To gain a better understanding of figure A-4 and the theory 
behind it, let us stop for a moment and see how confidence 
levels are calculated. Recall from chapter 2 that the combina- 
tion of events that might result from a test of two devices was 
given by 
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R 2 +2RQ+Q 2 = 1 

where 

probability that both devices will pass 
IRQ probability that one device will pass and one will fail 
C? 2 probability that both devices will fail 

In the power supply example, we observed the first event R 2 
because both supplies passed the test. If we assume a 50-percent 
probability that both will pass, we can set R 2 = 0.50 and solve 
for the reliability of the device as follows: 

R 2 =0.50 

R = S ~. 50 =0.71 

We then can say with 50-percent confidence that the population 
reliability of the device is no less than 0.71. By assuming a 
50-percent chance, we are willing to accept a 50-percent risk of 
being wrong, hence the term “50 percent confident.” If we want 
only to take a 40-percent risk of being wrong, we can again 
solve for R from 

R 2 = 0.40 
fl = VO40 =0.63 

In this case, we can be 60 percent confident that the population 
reliability of the devices is no less than 0.63. 

Selection of the confidence level is a customer’s or engineer’ s 
choice and depends on the amount of risk he is willing to take 
on being wrong about the reliability of the device. The customer 
usually specifies the risk he is willing to take in conjunction 
with the system reliability requirement. As higher confidence 
levels (lower risk) are chosen, the lower the reliability estimate 
will be. For example, if we want to make a 90-percent confi- 
dence (10-percent risk) statement based on the results of the test 
to success of two devices, we simply solve 

R 2 =(l - Confidence level) = 1-0.90 = 0.10 

so that 

fl = V0ri0 =0.316 

Table 6-2 illustrates how the reliability lower bound changes 
with various confidence levels. The curves in figure A— 4 are 
developed in a similar manner. In figure A-4(b), which is used 


TABLE 6-2 — RELIABILITY AND CONFIDENCE 
LEVEL FOR TWO-SAMPLE ATTRIBUTE TEST 
WITH NO FAILURES 


Confidence 

level, 

percent 

Reliability, 

R 

Risk, 

percent 

10 

0.95 

90 

50 

.71 

50 

60 

.63 

40 

70 

.55 

30 

80 

.45 

20 

90 

.32 

10 

99 

.10 

1 


when one failure is observed, for 10 samples tested with one 
observed failure, the statistically predicted or demonstrated 
reliability at 90-percent confidence is 0.66. This answer is 
found by solving 

/? IO + 10fl 9 <2= 1-0.90 
R = 0.663 

which agrees with the figure to two places. 

Application — The discussion thus far has underscored the 
shortcomings of attribute tests when sample sizes are small. 
Tests involving only two or three samples may reveal gross 
errors in hardware design or manufacturing processes, but when 
relied upon for anything more, the conclusions become risky 
(refs. 6-7 and 6-8). 

Attribute tests can be useful in testing for reliability when a 
sufficient sample size is used. For example, 10 samples tested 
without failure statistically demonstrate a population reliability 
of 0.79 at 90-percent confidence; 100 tests without failure 
demonstrate a population reliability of 0.976 at 90- percent con- 
fidence. To understand better the application of attribute tests 
and the use of figure A-4, consider the following examples: 
Example 1: During the flight testing of 50 missiles, five 
failures are observed. What confidence do we have that the 
missile is 80 percent reliable? 

Solution 1: From figure A-4(f) the answer is read directly to 
be a 95-percent confidence level. The a posteriori reliability of 
these 50 missiles, or that derived from the observed facts, is still 
45/50 =90 percent. Thus, future flights will be at least 80 percent 
reliable with a 5-percent risk of being wrong. 

Example 2: An explosive switch has a reliability requirement 
of 0.98. How many switches must be fired without a failure to 
demonstrate this reliability at 80-percent confidence? 

Solution 2: From figure A^t(a), the answer is read directly as 
80 switches. 

Example 3: A test report states that the reliability of a device 
was estimated to be 0.992 at 95-percent confidence based on a 
test of 1000 samples. How many failures were observed? 

Solution 3 : In figure A-4(d), the 95-percent confidence curve 
crosses the 1000-event line at R = 0.992. Therefore, three 
failures were observed. 
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In these examples, the population reliability estimates may 
represent any of the P terms or the AT-factors in the expression 
for product reliability, depending on the definition of failure 
used to judge the test results. For a device that is judged only on 
its capability to remain within certain tolerances, the reliability 
would be the P t term. Had catastrophic failures been included, 
we would have demonstrated the P c and P t terms. In general, 
attribute tests include all failure modes as part of the failure 
definition and, consequently, the associated reliability is prod- 
uct reliability with both the P terms and the /^-factors included. 

Attribute test/safety margin slide rule.— A special-purpose 
slide rule that was developed to facilitate determining attribute 
test/safety margin confidence levels will be available in class 
for these exercises. (See the back of this manual for the slide 
rule and the instructions to assemble it.) 

Examples 4 (confidence level for attribute test): Attribute 
tests are tests to success. The objective is for a selected number 
of samples, called tests on the slide rule, to operate successfully 
at some predetermined stress level. Some tests, however, may 
fail. This slide rule handles combinations of up to 1000 tests and 
up to 500 failures. The answer is a direct population reliability 
reading of the untested population at a selected confidence 
level. Six confidence levels from 50 to 90 percent are available. 
(The statistical basis for this rule is the y} approximation of 
binomial distribution.) 

Example 4a: Fifteen items are tested withone failure observed. 
Whatisthepopulation reliability at 70-percent confidence level? 

Solution 4a: Set one failure on the movable slide above the 
70-percent confidence level index. Read from total number of 
tests the tests for a population reliability of 0.85 at 70-percent 
confidence level. By setting one failure at successive levels of 
confidence this example gives these population reliabilities: 
0.7 10 at 95-percent confidence level, 0.758 at 90 percent, 0.815 
at 80 percent, 0.873 at 60 percent, and 0.895 at 50 percent. 

Example 4b: A population reliability of 0.9 at 95-percent 
confidence level is desired. How many tests are required to 
demonstrate this condition? 

Solution 4b: Set zero failures at the 95-percent confidence 
level index. From total number of tests read 29 tests directly 
above 0.90 population reliability. Therefore, 29 tests without 
failure will demonstrate this combination. If, however, one 
failure occurs, set one failure at 95 percent. Then 46 others must 
pass the test successfully. Progressively more observed failures 
such as 10 (set of 10 at 95 percent) require 170 successes 
(160+10). 

Examples 5 ( confidence level for safety margins): Safety 
margin S M indicates the number of standard deviations o M 
between some preselected reliability boundary R h and the mean 
of the measured sample failure distribution. Thus, 
S M = (*m - + where and a fe the measured 

mean and standard deviation of the samples under test. The 
larger the sample size, the more nearly the measured S M 
approaches the safety margin of the untested population Sp. 
This rule equates S M for six levels of confidence for sample 


sizes N between 5 and 80. (Statistical basis for this rule: 
noncentral t distribution.) 

Example 5a: Ten items are tested to failure with an observed 
or measured S M of 5.8. What is the lower expected safety 
margin of the untested population at 90-percent confidence? 

Solution 5a: Set 5.8 on the movable slide at the top window 
for the S M value. Under N = 1 0 on the 90-percent window, read 
S D > 3.9. Without moving the slide, for successive levels of 
confidence, 4.45 at 80 percent, 4.85 at 70 percent, 5.21 at 
60 percent, and 5.57 at 50 percent. 

Example 5b: Six samples are available for test. What S M is 
required to demonstrate a population safety margin of 4.0 or 
greater at 90-percent confidence level? 

Solution 5b: Using the 90-percent window, set S D = 4.0 
opposite N = 6. At S M read 7. 1 . Therefore, test results of 7. 1 or 
greater will demonstrate Sp ^ 4.0 at a 90-percent confidence 
level. If 25 samples are available for test, set S D = 4.0 opposite 
N = 25 on the 90-percent window. An S M of only 5.0 or greater 
would demonstrate 4.0 or greater safety margin at 90-percent 
confidence. 

Sneak circuits. — During attribute testing, the flight hard- 
ware may sometimes not work properly because of a sneak 
circuit. A sneak circuit is defined for both hardware and software 
as follows (ref. 6-10): 

(1) Hardware: a latent condition inherent to the system 
design and independent of component failure that inhib- 
its a desired function or initiates an undesired function 
(path, timing, indication, label) 

(2) Software: an unplanned event with no apparent cause- 
and-effect relationship that is not dependent on hard- 
ware failure and is not detected during a simulated 
system test (path, timing, indication, label) 

Each sneak circuit problem should be analyzed, a cause 
determined, and corrective action implemented and verified. 
References 6-10 to 6-1 2 give a number of examples of how this 
can be done: 

( 1 ) Reluctant Redstone — making complex circuitry simple 

(2) F-4 example 

(3) Trim motor example 

(4) Software example 

A few minutes spent with one of these references should solve 
any sneak circuit problem. 

Attribute test summary. — In summary, four concepts should 
be kept in mind: 

(1) An attribute test, when conducted with only a few 
samples, is not a satisfactory method of testing for reliability, 
but it can identify gross design and manufacturing problems. 

(2) An attribute test is an adequate method of testing for 
reliability only when sufficient samples are tested to establish 
an acceptable level of statistical confidence. 
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(3) Some situations dictate attribute tests or no tests at all 
(e.g., limited availability or the high cost of samples, limited 
time for testing, test levels that exceed the limits of test 
equipment, and the need to use the test samples after testing). 

(4) Confidence, a statistical term that depends on supporting 
statistical data, reflects the amount of risk we are willing to take 
when stating the reliability of a product. 

Test-To-Failure Methods 

The purpose of the test- to- failure method is to develop a 
failure distribution for a product under one or more types of 
stress. Here, testing continues until the unit under test ceases to 
function within specified limits. Alternatively, test to failure 
may be accomplished by increasing electrical load or mechani- 
cal load until a failure is induced. The results are used to 
calculate the probability of the failure of the device for each 
load. In this case, the failures are usually tolerance or physical 
wearout. The test-to-failure method is also valuable because we 
can determine the “spread" or standard deviation of the loads 
that cause failure (or the spread of the times to failure, etc.). This 
spread has a significant effect on the overall reliability. 

In this discussion of test-to-failure methods, the term safety 
factor S F is included because it is often confused with safety 
margin S Safety factor is widely used in industry to describe 
the assurance against failure that is built into structural prod- 
ucts. Safety factor S F can be defined as 



where 

*avg, mean strength of material 

Rfr reliability boundary, the maximum anticipated operat- 
ing stress level the component receives 

We choose to define “safety margin" by taking into account 
the standard deviation or the spread of the data; hence, S M is the 
number of standard deviations of the strength distribution that 
lie between the reliability boundary R b and the mean strength 



where o s is the standard deviation of the strength distribution. 

Using ^presents little risk when we deal with materials with 
clearly defined, repeatable, and “tight" strength distributions, 
such as sheet and structural steel or aluminum. However, when 
we deal with plastics, fiberglass, and other metal substitutes or 
processes with wide variations in strength or repeatability. 



Figure 6-1.— Test-to-failure method applied to metallic structure. 
Mean strength of material, *avg s > 13; reliability boundary, 10; 
standard deviation, s $ , 0.75; safety factor, S/r, 13/10 or 1.3; safety 
margin, Sm, (|10-13|)/0.75 or 4.0; probability of defect, 0.00003 or 
0.003 percent. 

using S m provides a clearer picture of what is happening. In 
most cases, we must know the safety margin to understand how 
useful the safety factor is. 

Consider the example of the design of a support structure to 
hold cargo in a launch vehicle. The component strength is 
expressed and represented by its ability to withstand a particu- 
lar g force. Structural members (consisting of various mate- 
rials) are tested with a mechanical load until failure occurs. 

We may have materials with clearly defined, repeatable, and 
tight strength distributions, such as sheet and structural steel or 
aluminum. Here, using S F presents little risk (see fig. 6-1 for 
metallic structure where a normal (Gaussian) distribution is 
assumed). Alternatively, we may have plastics, fiberglass, and 
other metal substitutes or processes with wide variations in 
strength or repeatability and using provides a clearer picture 
of a potential problem (see fig. 6-2 for a metal substitute, a 
composite). 

To use and benefit from this concept we need to 

(1) Know the material strengths and distributions 

(2) Identify the reliability boundary R b for the loading of the 
material 

(3) Know the safety margin to understand the usefulness of 
the safety factor 

Using safety margins in this way in the design process has a 
major benefit because they provide a clearer picture of what is 
happening in the real world by taking strength distributions into 
account. Also, the difference in the probability of defects (cal- 
culated by solving for the area under the normal distribution 
curve to the left of R b ) is better reflected in the difference in the 
strength margins. 
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Figure 6-2.— Test-to-failure method applied to metal substitute 
(composite). Mean strength of material, ^avg$’ 13; reliability 
boundary, R& 10; standard deviation, <r s , 2.308; safety factor, 

Sp, 13/10 or 1.3; safety margin, S/vf (|10-13|)/2.308 or 1.3; 
probability of defect, 0.0968 or 9.68 percent. 

In summary, test-to-failure methods can be used to develop 
a strength distribution that provides a good estimate of toler- 
ance and physical wearout problems without the need for the 
large samples required for attribute tests (note that extrapola- 
tion outside the range of data should be avoided). The results 
of a test-to-failure exposure of a device can be used to predict 
the reliability of similar devices that cannot or will not be tested. 
Testing to failure also provides the means for evaluating the 
failure modes and mechanisms of devices so that improve- 
ments can be made. It was also shown that a safety factor is 
much more useful if the associated safety margin is known. 

Test procedure and sample size .— Devices that are not 
automatically destroyed upon being operated are normally not 
expended or destroyed during a functional test. Electronic 
equipment usually falls into this category . For such equipment, 
a minimum sample size of five is necessary, each sample being 
subjected to increasing stress levels until failure occurs or the 
limits of the testing facility are reached. In the latter case, no 
safety margin calculation is possible because no failures are 
observed. Here, we must rely on intuition when deciding the 
acceptability of the device. 

Test-to-failure procedure and sample size requirements for 
one-shot devices are different because a one-shot device is 
normally expended or destroyed during a functional test. Ordi- 
nance items such as squib switches fall into this category. For 
such devices, at least 20 samples should be tested, but 30 to 70 
would be more desirable . At least 1 2 fail ures should be observed 
during a test. In a typical one-shot test, of which there are many 
variations, a sample is tested at the reliability boundary and, if 
it passes, a new sample is tested at predetermined stress 
increments until a failure occurs. Then, the next sample is tested 
at one stress increment below the last failure. If this sample 
passes, the stress is increased one i ncrement for the ne xt sample . 
This process, depicted in figure 6-3, continues until at least 12 
failures have been observed. 



Figure 6-3. — Example of one-shot test-to- 
failure procedure. 


Safety margins for single failure modes .— For devices that 
exhibit a single failure mode during a test-to-failure exposure, 
the safety margin and the reliability are calculated by the techni- 
que just discussed in the definition of safety margin. The fol- 
lowing examples further illustrate the method and show the 
practical results. 

Example 6: A test was conducted on a vendor’s 0.25- and 
0.50-W film resistors to evaluate their ability to operate reliably 
at their rated power levels. Thirty samples of each type were 
tested by increasing the power dissipation until the resistance 
change exceeded 5 percent. The results are shown in 
figure 6-4, from which the following points are noteworthy: 

(1) The mean strength of the 0.25-W resistor was less than 
half the mean strength of the 0.50-W resistor: x = 1 . 19 W 
compared with = 2.6 W. This was to be expected since 

the 0.50-W resistor was larger, had more volume, and could 
dissipate more energy. 
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(2) The standard deviation of the 0.25- W resistor was almost 
the same as that for the 0.50-W resistor: cr 025 = 0.272 W; 
°b.50 = 0-332 W. This was also expected because both resistors 
were made by the same manufacturer and were subjected to the 
same process controls and quality acceptance criteria. 

(3) The 0.50-W resistor, because of its higher mean strength, 
had a safety margin of 6.32 with reference to its rated power dis- 
sipation of 0.50 W. According to table 5-5, this means that only 
0.0 9 149 resistors would exceed a 5-percent resistance change 
when applied at 0.50 W. The 0.25- W resistor, because of its 
lower mean strength, had a safety margin of only 3.45 with 
reference to its rated power of 0.25 W. According to table 5-5 
again, this means thatO.O^? resistors would exceed a 5-percent 
resistance change when applied at 0.25 W. Derating the 0.25 W 
to 0.125 W increased the safety margin to 3.92 and decreased 
the expected number of failures to 0.0^481, an improvement 
factor of 7.5. This, of course, is the reason for aerating compo- 
nents, as discussed in chapter 4. Although we have indicated that 
a safety margin of 6.32 has statistical meaning, in practice a 
population safety margin of 5 or higher indicates that the applic- 
able failure mode will not occur unless, of course, the strength 
distribution deviates greatly from a normal distribution. 

Example 7. A fiberglass material to be used for a flame shield 
was required to have a flexural strength of 15 000 psi. The 
results of testing 59 samples to failure are presented in fig- 
ure 6—5. The strength distribution of the material was calculated 
to have a mean of 19 900 psi and a standard deviation of 
4200 psi. The safety margin was then calculated as 

„ 15 000-19 000 


Because, from table 5—7, S M — x s ! o 5 = 1.17 indicates that 
87.9 percent of the samples will fail at reliability boundaries 
above 15 000 psi, we can see that 12.1 percent will fail at 
boundaries below 15 000 psi. This analysis is optimistic in that 





1 1 l I I 

-2c -1c x 1c 2o 

Figure 6-5. — Strength distribution in fiberglass material. 
X s = 19 000 psi; c s = 4200 psi. 


11/59= 18.7 percent actually did fail below 15 000 psi. The test 
also shows that the reliability of the flame shield could be 
improved either by selecting another type of material to obtain 
a higher mean strength or by changing the fabrication processes 
to reduce the large strength deviation. 

Example 8: Samples of transistors from two vendors were 
tested to failure under high temperatures. Failure was defined 
as any out-of-tolerance parameter. The results shown in fig- 
ure 6-6 indicate that vendor B’s materials, design, and process 
control were far superior to vendor A’s as revealed by the large 
differences in mean strength and standard deviation. With an 
S M of 1 .41, 7.9 percent of vendor A’s transistors would fail at 
the 74 C reliability boundary; with an of 8.27, vendor B’s 
transistors would not be expected to fail at all. It is unlikely that 
an attribute test would have identified the better transistor. 

Example 9: Squib switch samples were tested to failure under 
vibration in accordance with the procedure for testing one-shot 
items. The results are shown in figure 6—7, where the mean and 
standard deviations of the failure distribution have been calcu- 
lated from the failure points observed. As shown, x =14 g’ s 
and o s = 1.04 g’s to produce a safety margin of 3.84 with 
reference to the reliability boundary of 10 g’s. 

The preceding examples have shown how the P product 
reliability term can be effectively demonstrated through test- 
to- failure methods. This has been the case because each example 
except the squib switch involved a tolerance problem. The 
examples also show that the K m factor plays an important role 
in product reliability and that control over //-f actors can ensure 
a significant increase in reliability. 

Multiple failure modes.— Most products perform more than 
one function and have more than one critical parameter for each 
function. In addition, most products are made up of many types 
of materials and parts and require many fabrication processes 
during manufacture. It follows then that a product can exhibit 
a variety of failure modes during testing. 
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Figure 6-6— Test-to-failure results for twotransistors. (a) Vendor A. 
*s = 105 °C; <r s = 22 °C. (b) Vendor B. X s = 1 65 °C; a s = 1 1 °C. 
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Figure 6-8.-— Test-to-failure results when multiple failure modes 
are observed. 


In the conduct of a test to failure, each failure mode detected 
must be evaluated individually; that is, a failure distribution 
must be developed for each failure mode and safety margins 
must be calculated for each individual failure distribution. 
Moreover, as mentioned before, at least five samples or failure 
points are needed to describe each failure mode distribution. 

To see this more clearly, consider the test results shown in 
figure 6-8. Here, each of the three failure modes observed is 
described in terms of its own failure distribution and resulting 
safety margin with reference to the same reliability boundary. 
If these failure modes are independent and each represents an 
out-of-tolerance P t condition, the P t of the test device is given 
by 

^/,total = =3.5 )P a {S M = 2.1 )P 0 (S M = 7.6) 

= (0.9998)(0.982 1)(1 .00) = 0.98 1 9 

This also shows that the independent evaluation of each failure 
mode identifies the priorities necessary to improve the product. 
For example, the elimination of failure mode 2, either by 
increasing P [2 to 1 or by eliminating the mode altogether 
increases P f total from 0.9819 to 0.9998. 



Temperature, °F 


Figure 6-9. — Stress distribution for operating temperature. 
X s = 85 °F; ct s = 20 °F. 


When stress distribution is known . — When safety margins 
are calculated with reference to a single point or a fixed 
reliability boundary, the resulting reliability estimate is conser- 
vative because it is assumed that the equipment will always be 
operated at the reliability boundary. As an illustration, fig- 
ure 6_9 shows the stress distribution for the operating tempera- 
ture of a device and the maximum anticipated operating limit 
( 145 °F), which is given in the device specifications and would 
normally be considered the reliability boundary. 

Figure 6-10 shows the strength distribution of the device for 
high temperatures and also that a safety margin for the device, 
when referenced to the 145 °F reliability boundary, is 1 .54, or a 
reliability of 93.8 percent. We know, however, that the 145 °F 
limit is the 3 cr limit of the stress distribution and will occur only 
0. 1 35 percent of the time. The question is, How does this affect 
the estimated reliability of the device in the temperature 
environment? 

If we select random values from the stress and strength dis- 
tribution and subtract the stress value from the strength value, 
a positive result indicates a success — the strength exceeds the 
stress. A negative result indicates a failure — the stress exceeds 
the strength. With this knowledge, we can calculate a difference 
distribution and through the application of the safety margin 
technique, solve for the probability of the strength being greater 
than the stress (i.e., success). This difference distribution is also 
distributed normally and has the following parameters: 
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R fr = 145 °F 



S M = 1.54 
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Figure 6-10. — Strength distribution for operating temper- 
ature. X s = 165 °F; <r s = 13 °F. 



Figure 6-1 1 . — Strength and stress difference distribution. 
X s = 80 °F; a $ = 24 °F. 


-^difference x s -^stress 

/ 2 _ 2 ) [/2 
^difference [®s ^stress j 

From the strength and stress distribution parameters given in 
the preceding example (figs. 6-9 and 6-10), 


TABLE 6-3.— CONFIDENCE LEVEL TABLES 
FOR VARIOUS SAMPLE SIZES 


Confidence 

level, 

percent 

Sample size 

5 to 12 

13 to 20 

21 to 29 

30 to 100 

Confidence level tables 

99 

A-3(a) 

A-3(b) 

A-3(c) 

A-3(d) 

95 

A-4(a) 

A—4(b) 

A-4(c) 

A— 4(d) 

90 

A-5(a) 

A-5(b) 

A-5(c) 

A-5(d) 


^difference = 165 -85 = 80 °F 

^difference = (^0 + 13 j = 24 °F 

This distribution is shown in figure 6-11. 

Because positive numbers represent success events, we are 
interested in the area under the difference distribution that 
includes only positive numbers. This can be calculated by using 
zero as the reliability boundary and solving for the safety 
margin from 


S M - 


0 — x s 
°s 


0-80 

24 


3.33 


This 3.33 safety margin gives a reliability of 0.9996 when the 
stress distribution is considered. Comparing this result with the 
estimated reliability of 0.938 when the reliability boundary 
point estimate of 145 °F was used shows the significance of 
knowing the stress distribution when estimating reliability 
values. 

Confidence levels. — As discussed before, the main objective 
in developing a failure distribution for a device by test-to- 
failure methods is to predict how well a population of like devices 
will perform. Of course, such failure distributions, along with 
the resulting safety margins and reliability estimates, are sub- 
ject to error. Errors result from sample size limitations in much 
the same way as the demonstrated reliabi lity varies with sample 
size in attribute testing. Specifically, the mean and the standard 


deviations of the strength distribution must be adjusted to 
reflect the sample size used in their calculation. For this 
purpose, tables A-3 to A-5 in appendix A have been developed 
by using the noncentral / distribution. Table 6-3 shows the 
applicable appendix A tables for selected confidence levels and 
sample sizes, and the examples that follow illustrate their use. 

Example 10: Upon being tested to failure at high tempera- 
tures, 10 devices were found to have a failure distribution of 
x s = 1 12.7 °C and =16 °C. The reliability boundary was 
50 °C. Find the safety margin and reliability demonstrated at 
90-percent confidence. 

Solution 10: 

Step 1 — Solve first for the observed safety margin. 

^= 50 - 1 , 2.7 
a, 16 

From table 5-7, the observed reliability is 0.99996. 

Step 2 — Now in appendix A refer to table A-5(a), which deals 
with 90-percent confidence limits for safety margins, and 
follow across to column N = 10, the number of samples. The 
values under the N headings in all the tables listed in table 6-3 
represent the observed safety margins for sample sizes as 
calculated from raw test data. The column lists correspond- 
ing population safety margins for the observed safety margins 
shown under the N headings. Finally, corresponding popula- 
tion reliability estimates are shown under the P x headings, 
which may represent P { or P w as applicable. 
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Step 3— Proceed down the N = 10 column to 3.923, the 
observed safety margin derived in step 1. 

Step A — Having located S M = 3.923 with 10 samples, follow 
horizontally to the left to find the demonstrated population 
safety margin in the S M column. This is 2.6. 


Step 5— With a population S M of 2.6, follow the same line to the 
right to find the population reliability estimate under the P x 
heading. This value is 0.9953. Recall that the observed safety 
margin was 3.923 and the observed reliability, 0.99996. 

Example 11: Twelve gyroscopes were tested to failure by 
using time as a stress to develop a wearout distribution. The 
wearout distribution was found to have an x s of 5000 hours and 
a a of 840 hours. Find the P w demonstrated at 95-percent 
confidence with a reliability boundary of 1000 hours. 

Solution 11: 

Step 1— The sample safety margin is 


„ 1000 - 5000 

S M = 4.76 

M 840 

Step 2— The population safety margin at 95-percent confi- 
dence with a 12-sample safety margin of 4.76 is read directly 
from table A— 4(a) to be 3.0. 

Step 3— For a population S M of 3 .0, the corresponding P w under 
the P column is 0.9986. Therefore, 99.86 percent of the 
gyroscopes will not wear out before 1000 hours have been 
accumulated. 

Safety factor. — This section is included in the discussion of 
test-to-failure methods because the term “safety factor” is often 
confused with safety margin. It is used widely in industry to 
describe the assurance against failure that is built into structural 
products. There are many definitions of safety factor S F , with 
the most common being the ratio of mean strength to reliability 
boundary: 



When dealing with materials with clearly defined, repeatable, 
and “tight” strength distributions, such as sheet and structural 
steel or aluminum, using S F presents little risk. However, when 
dealing with plastics, fiberglass, and other metal substitutes or 
processes with wide variations in strength or repeatability , using 
S M provides a clearer picture of what is happening (fig. 6-12). 
In most cases, we must know the safety margin to understand 
how accurate the safety factor may be. 

Test-to-failure summary — In summary, you should under- 
stand the following concepts about test-to-failure applications: 

( 1 ) Developing a strength distribution through test-to-failure 
methods provides a good estimate of the P t and P w product 



Figure 6-1 2. — Two structures with identical safety factors 
(, Sp= 13/10 = 1.3) but with different safety margins. 

(a) Structure A. (b) Structure B. 


reliability terms without the need for the large samples required 
for attribute tests. 

(2) The results ofatest-to-failureexposureofadevice can be 
used to predict the reliability of similar devices that cannot or 
will not be tested. 

(3) Testing to failure provides a means of evaluating the 
failure modes and mechanisms of devices for improvement 
purposes. 

(4) Testing to failure allows confidence levels to be applied 
to the safety margins and to the resulting population reliability 
estimates. 
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(5) To know how accurate a safety factor may be, we must 
also know the associated safety margin. 


Life Test Methods 

Life tests are conducted to illustrate how the failure rate of a 
typical system or complex subsystem varies during its operat- 
ing life. Such data provide valuable guidelines for controlling 
product reliability. They help to establish bum-in require- 
ments, to predict spare part requirements, and to understand the 
need for or lack of need for a system maintenance program. 
Such data are obtained through laboratory life tests or from the 
normal operation of a fielded system. 

Life tests are performed to evaluate product failure-rate 
characteristics. If failures include all causes of system failure, 
the failure rate of the system is the only true factor available for 
evaluating the system's performance. Life tests at the parts 
level often require large sample sizes if realistic failure-rate 
characteristics are to be identified and laboratory life tests are 
to simulate the major factors that influence failure rates in a 
device during field operations. Furthermore, the use of running 
averages in the analysis of life data will identify bum-in and 
wearout regions if such exist. Failure rates are statistics and 
therefore are subject to confidence levels when used in making 
predictions (see refs. 6-13 to 6-1 7). 

Figure 6—13 illustrates what might be called a failure surface 
for a typical product. It shows system failure rate versus 
operating time and environmental stress. These three param- 
eters describe a surface such that, given an environmental stress 
and an operating time, the failure rate is a point on the surface. 



Test-to-failure methods generate lines on the surface parallel 
to the stress axis; life tests generate lines on the surface parallel 
to the time axis. Therefore, these tests provide a good descrip- 
tion of the failure surface and, consequently, the reliability of 
a product. 

Attribute tests result only in a point on the surface if failures 
occur and a point somewhere on the x,y-plane if failures do not 
occur. For this reason, attribute testing is one of the least 
desirable methods for ascertaining reliability. Of course, in the 
case of missile flights or other events that produce go/no-go 
results, an attribute analysis is the only way to determine 
product reliability. 

Application . — Although life test data are derived basically 
for use in evaluating the failure characteristics of a product, 
byproducts of the evaluation may serve many other purposes. 
Four of the most frequent are 

(1) To serve as acceptance criteria for new hardware. For 
example, a product may be subjected to a life test before it is 
accepted for delivery to demonstrate that its failure rate is below 
some predetermined value. Examples of such applications are 
bum-in or debugging tests and group B life tests conducted on 
electronic parts. Some manufacturers of communications sat- 
ellites subject all electronic parts to a 1200-hour bum-in test 
and use only the ones that survive. 

(2) To identify product improvement methods. Here, life 
tests serve a dual purpose by providing hardware at essentially 
no cost for physics-of-failure analyses. In turn, these analyses 
identify failure mechanisms and the action needed to reduce 
effectively a product’s failure rate. In the past 10 years, this has 
resulted in significant part failure-rate reductions. In fact, the 
failure rates of some components have been reduced so far that 
accelerated life tests (life tests at elevated stress levels) and 
test-to-failure techniques must be employed to attain reliability 
improvements in a reasonable timeframe. 

(3) To establish preventive maintenance policies. Products 
with known or suspected wear mechanisms are life tested to 
determine when the wearout process will begin to cause 
undesirable failure-rate trends. Once the wearout region is 
established for a product, system failures can be reduced by 
implementing a suitable preventive maintenance plan or over- 
haul program. This is effectively illustrated in figure 6-14, 
which shows the failure-rate trend in a commercial jet aircraft 
subsystem. Here, the upward trend after 4000 hours of opera- 



Figure 6-14. — Failure-rate characteristics of commercial jet electronic 
subsystem. 
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tion was revealed to be caused by a servomechanism that 
required lubrication. By establishing a periodic lubrication 
schedule for the mechanism, further failures were eliminated. 
Note that this subsystem also exhibited bum-in and intrinsic- 
failure-rate regions. 

(4) To assess reliability. Here, tests are performed or life data 
are collected from fielded systems to establish whether contrac- 
tual reliability requirements are actually being met. In cases of 
noncompliance and when the field failures are analyzed, one of 
the preceding methods is employed to improve the product, or 
else a design change is implemented. The effectiveness of the 
corrective action is then evaluated from additional life data. 
Because life- test-observed failure rates include catastrophic, 
tolerance, wearout, and K-factor failures, life tests usually 
demonstrate product reliability. 

Test procedure and sample size . — Conducting a life test is 
fairly straightforward. It involves only the accumulation of 
equipment operating time. Precautions must be taken, how- 
ever, when the test is conducted in a laboratory. Operating con- 
ditions must include all the factors that affect failure rates when 
the device is operated tactically. Major factors are environ- 
ment, power-on and power-off times, power cycling rates, 
preventive maintenance, operator tasks, and field tolerance 
limits. Ignoring any of these factors may lead to an unrealistic 
failure- rate estimate. 

When accelerated life tests are conducted for screening 
purposes, stress levels no greater than the inherent strength of 
the product must be chosen. The inherent strength limit can be 
evaluated through test-to-failure methods before the life tests 
are conducted. 

Experience with nonaccelerated life tests of military stan- 
dard electronic parts for periods as long as 5000 hours indicates 
that an average of one to two failures per 1000 parts can be 
expected. For this reason, life tests will not provide good 
reliability estimates at the part level except when quantities on 
the order of 1000 or more parts are available. On the other hand, 
life tests are efficient at the system level with only one sample 
as long as the system is fairly complex (includes several thousand 
parts). 

Life tests intended to reveal the wearout characteristics of a 
device may involve as few as five samples, although from 20 to 
30 are more desirable if a good estimate of the wearout 
distribution is to be obtained. 

Analyzing life test data .— Recall from chapter 3 that an 
empirical definition of mean time between failures (MTBF) 
was given as 

Total test hours 

MTBF — 

Total observed failures 

Remember also that because this expression neglects to show 
when the failures occur, it assumes an intrinsic failure rate and 
therefore an intrinsic mean time between failures, or MTBF. 


The assumption of an intrinsic failure rate may not be valid in 
some cases, but life test results have traditionally been reported 
this way. 

To see this illustrated, consider the results of a 4000-hour life 
test of a complex (47 000 parts) electronic system as shown in 
figure 6-15. This graph plots cumulatively in terms of the times 
the 47 failures are observed so that the slopes of the lines 
represent the failure rate. The solid line shows the system 
failure rate that resulted from assuming an intrinsic failure rate, 
which was 

A = Total failures — = J]_ = j fajlure/ 86 hours 

Total operation time 4000 

From the plotted test data, it is obvious that this intrinsic failure 
rate was not a good estimate of what really happened. The plotted 
data indicate that there were two intrinsic-failure-rate portions: 
one from 0 to 1 000 hours and the other from 1 000 to 4000 hours . 
In the 0- to 1000-hour region, the actual failure rate was 

A = — = 1 failure/ 29 hours 
1000 

or about 3 times higher than the total average failure rate of 
1/86 hours; in the 1000- to 4000-hour region, the actual failure 
rate was 



Figure 6-15.— Results of complex electronic system life test. 
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A = = 1 failure/ 250 hours 

3000 

or about 2.9 times lower than the average. 

This illustration establishes the desirability of knowing when 
failures occur, not just the number of failures. The results of 
analyzing data by regions can be used to evaluate burn-in and 
spare parts requirements. The burn-in region was identified to 
be from 0 to 1 000 hours because after this time the failure rate 
decreased by a factor of 8.6. 

This result also has a significant effect on logistics. For 
example, if we assume that the system will accumulate 
1000 hours per year, we can expect during the first year to 
replace 35 parts: 


1 failure 
29 hours 


x 1 000 hours 


whereas during the next and subsequent years we can expect to 
make only four replacements: 


1 failure 
250 hours 


x 1 000 hours 


Using the average failure rate of 1 failure/86 hours, we would 
have to plan, however, for 28 replacements every year. Obvi- 
ously, the cost impact of detailed analysis can be substantial. 

Running averages . — When system failure rates are irregular 
or when there is a need to evaluate the effect of different 
operating conditions on a system, running average analyses are 
useful. This can best be illustrated through the example pre- 
sented in figure 6-16. A 300-hour running average in 50-hour 
exposures is shown for a complex system during an engineer- 
ing evaluation test. (Running averages are constructed by 
finding the failure rate for the first 300 hours of operation, then 
dropping the first 50 hours and picking up the 300- to 350-hour 
interval and calculating the new 300-hour regional failure rate, 
and then repeating the process by dropping the second 50 hours 
of data and adding the next 50 hours for the total test period.) 
From the resultant curve, you can readily see ( 1 ) the effects of 
the debugging test, (2) the increase in failure rate during the 
high-temperature test and the decrease after that test, (3) 
another increase during low-temperature exposure and the sub- 
sequent decrease, (4) a slight increase caused by vibration, and 
(5) a continuously decreasing rate as the test progressed. The 
curve indicates that the system is the most sensitive to high 
temperature and that because the failure rate continued to 
decrease after high-temperature exposure, exposure to high 
temperatures is an effective way to screen defective parts from 
the system. Because the failure rate continued to decrease after 
the tests were completed, neither low temperature nor vibration 
caused permanent damage to the system. 



Operating time, hr 

Figure 6-16. — Running average failure-rate analysis of life test 
data (300-hr running average in 50-hr increments). 


At the end of the 3000-hour period, the failure rate was 
3.3 failures per 1000 hours. This reflected a tenfold decrease 
from the initial failure rate during debugging, typical of the 
results observed for many complex systems. An example of a 
running average failure-rate analysis that identifies a system 
wearout region is shown in figure 6-17. The increasing failure 
rate after 3000 hours was caused by relay failures (during 
approximately 10 000 cycles of operation). This type of infor- 
mation can be used to establish a relay replacement requirement 
as part of a system preventive maintenance plan. 

Confidence levels . — As discussed in chapter 4, failure rates 
are statistical. Consequently, they are subject to confidence 
levels just as attribute and test- to- failure results are influenced 
by such factors. Confidence levels for intrinsic failure rates are 
calculated by using table A-2 in appendix A. 

To use this table, first calculate the total test hours accumu- 
lated from 


i*=i 

where 

N- / th unit tested 
t- test time of N i 
n total units tested 

Then find under the number of failures observed during the test 
the tolerance factor for the desired confidence level. The lower 
limit for the MTBF at the selected confidence level is then 
found from 
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Figure 6-1 7.— Running average failure-rate analysis of life test 
data identifying wearout region (600-hr running average in 
200-hr increments). 


Example 14: Had four of the six failures in example 1 3 been 
observed in the first 1000 hours, what would be the demon- 
strated MTBF at 80-percent confidence in the region from 1000 
to 3000 hours? 

Solution 14: 

Step 1 — The total test time is given as t = 2000 hours. 

S tep 2 — From table A-2 find the tolerance factor for two 
failures at 80-percent confidence to be 4.3. 

Step 3 — Find the demonstrated MTBF at 80-percent confi- 
dence after 1000 to 3000 hours. 

7000 

MTBF = = 465 hours 

4.3 

Example 15: It is desired to demonstrate an 80-hour MTBF 
on a computer at 90-percent confidence. How much test time 
is required on one sample if no failures occur? 

Solution 15: 

Step 1 From table A-2 find the tolerance factor for no 

failures at 90-percent confidence to be 2.3. 


MTBF = 


t 

Tolerance factor 


and the upper limit for failure rate from 


Step 2— Because the desired 90-percent-confidence MTBF 
is given as 80 hours and the tolerance factor is known, calculate 
the total test time required from 


, Tolerance factor 
A = — 

t 

Example 13: A system was life tested for 3000 hours, during 
which six failures were observed. What is the demonstrated 
80-percent-confidence MTBF? 

Solution 13: 

Step 1 — Solve for the total test hours. 

n 

t = ^ NjTj = 1 x 3000 = 3000 

i=l 

Step 2 — From table A-2 find the tolerance factor for six 
failures at 80-percent confidence to be 9.0. 

Step 3 — Solve for the demonstrated MTBF. 

t 3000 , 

MTBF = = 333 hours 

Tolerance factor 9 

in contrast to the observed MTBF of 3000/6 = 500 hours. 


t = ( MTB F)(T olerance factor) = (80)(2.3) = 184 hours 

to prove that 184 hours with no failures demonstrates an 
80-hour MTBF at 90-percent confidence. 

A good discussion of fixed time and sequential tests is given 
in MIL-STD-7 8 1 D (ref. 6-3). 

Life test summary.— In summary, the following concepts 

are reiterated: 

( 1 ) Life tests are performed to evaluate product failure-rate 
characteristics. 

(2) If “failures” include all causes of system failure, the 
failure rate of the system is the only true factor available for 
evaluating the system’s performance. 

(3) Life tests at the part level require large sample sizes if 
realistic failure-rate characteristics are to be identified. 

(4) Laboratory life tests must simulate the major factors that 
influence failure rates in a device during field operations. 

(5) The use of running averages in the analysis of life data 
will identify bum-in and wearout regions if such exist. 

(6) Failure rates are statistics and therefore are subject to 
confidence levels when used in making predictions. 
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Conclusion 

When a product fails, whether during a test or from service, 
a valuable piece of information about it has been generated. We 
have the opportunity to learn how to improve the product if we 
take the right actions. 

Much can be learned from each failure by using good failure 
reporting, analysis, and a concurrence system and by taking 
corrective action. Failure analysis determines what caused the 
part to fail. Corrective action ensures that the cause is dealt with. 

With respect to testing, experimentation and evaluation to 
determine failure modes and effects greatly benefit reliability 
analysis. They do so by giving precise answers to the questions 
of why and how a product or component fails. Testing helps to 
reduce high development risks associated with a completely 
new design, to analyze high-risk portions of the design, and to 
confirm analytical models. 

Attribute tests, although not the most satisfactory method of 
testing, can still identify gross design and manufacturing 
problems. Test-to-failure methods can be used to develop a 
strength distribution that gives a good estimate of tolerance and 
physical wearout problems without the need for large samples 
required in attribute tests. Life tests are performed to evaluate 
product failure-rate characteristics but the tests must be 
carefullydesigned. 

All these test methods can be used to establish system level 
reliability and when conducted properly and in a timely fash- 
ion, can give valuable information about product behavior 
and overall reliability. 
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Reliability Training 1 

1 . Seven hydraulic power supplies were tested in a combined high-temperature and vibration test. Outputs of six of the seven 
units tested were within limits. 

a. What is the observed reliability R of the seven units tested? 

A. 0.825 B. 0.857 C. 0.913 

b. What is the predicted population reliability R at 80-percent confidence? 

A. 0.50 B. 0.75 C. 0.625 

c. How many tests (with one failure already experienced) are needed to demonstrate R = 0.88 at 80-percent confidence? 

A. 24 B. 15 C. 30 

2. A vibration test was conducted on 20 autopilot sensing circuits with these results. Mean x s - 7.8 g s, standard deviation 
o s = 1.2 g’s; reliability boundary R b = 6 g’s. 

a. What is the observed safety margin S M ? 

A. 2.0 B. 1.0 C. 1.5 

b. What is the observed reliability R2 

A. 0.900 B. 0.935 C. 0.962 

c. What is the predicted population safety margin S M at 80-percent confidence? 

A. 1.19 B. 2.19 C. 3.19 

d. What is the predicted population reliability R at 80-percent confidence? 

A. 0.75 B. 0.95 C. 0.88 

e. How could the autopilot be made more reliable? 

A. Add brackets, thicker mounting materials, stiffer construction. 

B. Control material tolerances more tightly; inspect torque values and weld assemblies. 

C. Use vibration isolators. 

D. All of the above. 

3. Twenty-five low-pressure hydraulic line samples were tested to destruction. These lines are rated to carry 30 psia (R h ). 
x s = 31.5 psia; = 0.75 psia. 

a. What is the observed S M of these test items? 

A. 1.0 B. 2.0 C. 3.0 

1 Answers are given at the end of this manual. Please assemble and use the slide rule at the back of this manual to do this problem set. 
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b. What is the predicted population safety margin S M at 90-percent confidence? 

A. 0.95 B. 1.25 C. 1.51 

c. The design requirement calls for an S M > 4.0 at 90-percent confidence. After discussing the problem with the designer, it 

was Earned that the 30-psia rating included a 2.5-psia “pad." Using the corrected R b of 27.5 psia, now what are the S.. and 
S D at 90-percent confidence? M 

L (observed) = ? 

A. 4.22 B. 5.33 C. 6.44 

it. S D (predicted) = ? 

A. 4.28 B. 3.75 C. 4.80 
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Chapter 7 

Software Reliability 


Software reliability management is highly dependent on how 
the relationship between quality and reliability is perceived. For 
the purposes of this manual, quality is closely related to the 
process, and reliability is closely related to the product. Thus, 
both span the life cycle. 

Before we can stratify software reliability, the progress of 
hardware reliability should be briefly reviewed. Over the past 
25 years, the industry has observed ( 1 ) the initial assignment of 
“wizard status” to hardware reliability for theory, modeling, 
and analysis, (2) the growth of the field, and (3) the final 
establishment of hardware reliability as a science. One of the 
major problems was aligning reliability predictions and field 
performance. Once that was accomplished, the wizard status 
was removed from hardware reliability. The emphasis in hard- 
ware reliability from now to the year 2000, as discussed in 
chapter 1, will be on system failure modes and effects. 

Software reliability has reached classification as a science for 
many reasons. The difficulty in assessing software reliability is 
analogous to the problem of assessing the reliability of a new 
hardware device with unknown reliability characteristics. The 
existence of 30 to 50 different software reliability models 
indicates the organization in this area. As discussed in chapter 
1 , hardware reliability started at a few companies and later was 
the focus of the AGREE reports. The field then logically 
progressed through different models in sequence over the years. 
Along the same lines, numerous people and companies have 
simultaneously entered the software reliability field in their 
major areas: namely, cost, complexity, and reliability. The 
difference is that at least 100 times as many people are now 
studying software reliability as initially studied hardware reli- 
ability. The existence of so many models and their purports 
tends to mask the fact that several of these models have shown 
excellent correlations between software performance predic- 
tions and actual software field performance; for instance, the 
Musa model as applied to communications systems and the 
Xerox model as applied to office copiers. There are also reasons 
for not accepting software reliability as a science, and they are 
briefly discussed here. 


One impediment to the establishment of software reliability 
as a science is the tendency toward programming development 
philosophies such as ( 1) “do it right the first time” (a reliability 
model is not needed), or (2) “quality is a programmer’s devel- 
opment tool,” or (3) “quality is the same as reliability and is 
measured by the number of defects in a program and not by its 
reliability.” All these philosophies tend to eliminate probabilis- 
tic measures because the managers consider a programmer to 
be a software factory whose quality output is controllable, 
adjustable, or both. In actuality, hardware design can be con- 
trolled for reliability characteristics better than software design 
can. Design philosophy experiments that failed to enhance 
hardware reliability are again being formulated for software 
design. (Some of the material in this chapter is reprinted with 
permission from ref. 7—1.) Quality and reliability are not the 
same. Quality is characteristic and reliability is probabilistic. 
Our approach draws the line between quality and reliability 
because quality is concerned with the development process and 
reliability is concerned with the operating product. Many models 
have been developed and a number of the measurement models 
show great promise. Predictive models have been far less 
successful partly because a data base (such as MIL-HDBK- 
2 1 7E (ref. 7-2) for hardware) is not yet available for software. 
Software reliability often has to use other methods; it must be 
concerned with the process of software product development. 


Models 

The development of techniques for measuring software reli- 
ability has been motivated mainly by project managers who not 
only need ways of estimating the manpower required to de- 
velop a software system with a given level of performance but 
also need techniques for determining when this level of perfor- 
mance has been reached. Most software reliability models 
presented to date are still far from satisfying these two needs. 

Most models assume that the software failure rate will be 
proportional to the number of implementation and design errors 
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in the system without taking into account that different kinds of 
errors may contribute differently to the total failure rate. Elimi- 
nating one significant design error may double the mean time 
to failure, whereas eliminating 10 minor implementation errors 
(bugs) may have no noticeable effect. Even assuming that the 
failure rate is proportional to the number of bugs and design 
errors in the system, no model considers that the failure rate will 
then be related to the system workload. For example, doubling 
the workload without changing the distribution of input data to 
the system may double the failure rate. 

Software reliability models can be grouped into four catego- 
ries: time domain, data domain, axiomatic, and other. 


Time Domain Models 

Models formulated in the time domain attempt to relate 
software reliability (characterized, for instance, by a mean- 
time-to-failure (MTTF) figure under typical workload condi- 
tions) to the number of bugs present in the software at a given 
time during its development. Typical of this approach are the 
models presented by Shooman (ref. 7-3), Musa (ref. 7-4), and 
Jelinsky and Moranda (ref. 7-5). Removing implementation 
errors should increase MTTF, and correlating bug removal 
history with the time evolution of the MTTF value may allow 
the prediction of reliability when a given MTTF will be reached. 
The main disadvantages of time domain models are that bug 
correction can generate more bugs and that software unreliability 
can be due not only to implementation errors but also to design 
(specification) errors, characterization, and simulation during 
testing of the typical workload. 

The Shooman model (ref. 7-3) attempts to estimate the 
software reliability — that is, the probability that no software 
failure will occur during an operating time interval (0,r) — from 
an estimate of the number of errors per machine-language 
instruction present in a software system after T months of 
debugging. The model assumes that at system integration there 
are E i errors present in the system and that the system is 
operated continuously by an exerciser that emulates its real use. 
The hazard function after T months of debugging is assumed to 
be proportional to the remaining eiTors in the system. The 
reliability of the software system is then assumed to be 

K(0 = e" C£ ('- r > 

where E(r y T) is the remaining number of errors in the system 
after T months of debugging and Cis a proportionality constant. 
The model provides equations for estimating Cand E(r,T) from 
the results of the exerciser and the number of errors corrected. 

The Jelinsky-Moranda model (ref. 7-5) is a special case of 
the Shooman model. The additional assumption made is that 
each error discovered is immediately removed, decreasing the 
remaining number of errors by one. Assuming that the amount 
of debugging time between error occurrences has an exponen- 


tial distribution, the density function of the time of discovery of 
the / th error, measured from the time of discovery of the O' - 1 ) th 
error is 

where X{i) =f(N- i + 1 ) and N is the number of errors originally 
present. The model gives the maximum likelihood estimates 
for jV and/. 

The Jelinsky-Moranda model has been extended by Wol verton 
and Schick (ref. 7-6). They assume that the error rate is 
proportional not only to the n umber of errors but also to the time 
spent in debugging, so that the chance of discovery increases as 
time goes on. Thayer, Lipow, and Nelson (ref. 7-7) give 
another extension in which more than one error can be detected 
in a time interval, with no correction being made after the end 
of this interval. New maximum likelihood estimators of A and 
/ are also given. 

All the models presented so far attempt to predict the reliabil- 
ity of a software system after a period of testing and debugging. 
In a good example of an application of this type of model, 
Miyamoto (ref. 7-8) describes the development of an on-line, 
real-time system for which a requirement is that the mean time 
between software errors (MTBSE) has to be longer than 
30 days. The system will operate on a day-by-day basis, 

1 3 hours a day. (It will be loaded every morning and reset every 
evening.) The requirement is formulated so that the value of the 
reliability function R(t) for t = 13 hours has to be greater than 
e (-t3/MTBSE) _ o,9672. Miyamoto also gives the MTBSE 
variations in time as a function of the debugging time. The 
MTBSE remained low for most of the debugging period, 
jumping to an acceptable level only at the end. The correlation 
coefficient between the remaining number of errors in the 
program and the failure rate was 0.77, but the scatter plot shown 
is disappointing and suggests that the correlation coefficient 
between the failure rate and any other system variable could 
have given the same value. In the same paper, Miyamoto 
describes in detail how the system was tested. 

None of the above models takes into account that in the 
process of fixing a bug, new errors may be introduced in the 
system. The final number given is usually the mean time 
between software errors, but only Miyamoto points out that this 
number is valid only for a specific set of workload conditions. 

Other models for studying the improvement in reliability of 
a software item during its development phase exist, such as 
Littlewood (ref. 7-9), in which the execution of a program is 
simulated with continuous-time Markov switching among 
smaller programs. This model also demonstrates that under 
certain conditions in the software system structure, the failure 
process will be asymptotically Poisson. Trivedi and Shooman 
(ref. 7-10) give another Markov model, in which the most 
probable number of errors that will have been corrected at any 
time t is based on preliminary modeling of the error occurrence 
and repair rates. The model also predicts the system’s availabil- 
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ity and reliability at time /. Schneidewind (ref. 7-11) describes 
a model which assumes that the failure process is described by 
a nonhomogeneous Poisson process. The rate of error detection 
in a time interval is assumed to be proportional to the number 
of errors present during that interval. This leads to a Poisson 
distribution with a decreasing hazard rate. 

Data Domain Models 

Another approach to software reliability modeling is study- 
ing the data domain. The first model of this kind is described by 
Nelson (ref. 7-12). In principle, if sets of all input data upon 
which a computer program can operate are identified, the 
reliability of the program can be estimated by running the 
program for a subset of input data. Thayer, Lipow, and Nelson 
(ref. 7-7) describe data domain techniques in more detail. 
Schick and Wolverton (ref 7-13) compare the time domain and 
data domain models. However, different applications will tend 
to use different subsets of all possible input data, yielding 
different reliability values for the same software system. This 
fact is formally taken into account by Cheung (ref. 7-14), in 
which software reliability is estimated from a Markov model 
whose transition probabilities depend on a user profile. Cheung 
and Ramamoorthy (ref. 7-15) give techniques for evaluating 
the transition probabilities for a given profile. 

In the Nelson model (ref. 7-12) a computer program is 
described as a computable function F defined on the set E — 
(£,/=!,... ,A0, where E includes all possible combinations 
of input data. Each E ■ is a sample of data needed to make a run 
of the program. Execution of a program produces, for a 
given value of E t> the function value £(£,). 

In the presence of bugs or design errors, a program actually 
implements F' . Let E e be the set of input data such that F ( E e ) 
produces an execution failure (execution terminates prema- 
turely, or fails to terminate, or the results produced are not 
acceptable). If N e is the quantity of E [ leading to failure F e 



is the probability that a run of the program will result in an 
execution failure. Nelson defines the reliability R as the prob- 
ability of no failures or 


In addition, this model is further refined to account for the 
fact that the inputs to a program are not selected from £ with 
equal a priori probability but are selected according to some 
operational requirement. This requirement may be character- 
ized by a probability distribution (P y , /= 1, . . . , AO, P\ being the 
probability that the selected input is E- r If we define the auxil- 


iary variables Y i to be 0 if a run with E { is successful, and 
l otherwise, 

N 

p= I>; 

i=i 

where p is again the probability that a run of the program will 
result in an execution failure. 

A mathematical definition of the reliability of a computer 
program is given as the probability of no execution failures after 
n runs: 

R{n)=R n =(\-p) n 

The model elaborates on how' to choose input data values at 
random for £ according to the probability distribution P- to 
obtain an unbiased estimator of R(n). In addition, if the execu- 
tion time for each £■ is also known, the reliability function can 
be expressed in terms of the more conventional probability of 
no failure in a time interval (0, t). 

Chapter 6 in Thayer, Lipow, and Nelson (ref. 7-7) extends 
the previous models to take into account how the testing of 
input data sets should be partitioned. Also discussed are the 
uncertainty in predicting reliability values, the effect of remov- 
ing software errors, and the effect of program structure. 


Axiomatic Models 

The third category includes models in which software reli- 
ability (as well as software quality in general) is postulated to 
obey certain universal laws (Ferdinand and Sutherla, ref. 7-16; 
Fitzsimmons and Love, ref. 7—17). Although such models have 
generated great interest, their general validity has never been 
proven and, at most, they only give an estimate of the number 
of bugs present in a program. 

The best-known axiomatic model is the so-called software 
science theory developed by Halstead (see ref. 7-18). Halstead 
used an approach similar to thermodynamics to provide quan- 
titative measures of program level, language level, algorithm 

purity, program clarity, effect of modularization, programming 
effort, and programming time. In particular, the estimated 
number of bugs in a program is given by the expression 


where 

K proportionality constant 

£ 0 mean number of mental discriminations between errors 
made by programmer 
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TABLE 7-1 —CORRELATION OF EXPERIENCE TO 
SOFTWARE BUG PREDICTION BY 
AXIOMATIC MODELS 


Reference 

Correlation coefficient 
between predicted and 
real number of bues 

Funami and Halstead (ref. 7-19) 

0.98, 0.83, 0.92 

Cornell and Halstead (ref. 7-20) 

0.99 

Fitzsimmons and Love (ref. 7-17): 


System A 

1 0.81 

System B 

.75 

System C 

.75 

Overall 

.76 


V' volume of algorithm implementation, N log-,(«) 
where 

N program length 

n size of vocabulary defined by language used 
More specifically, 

N=N\+N 2 

n = n, +« 2 

where 

iV, total number of occurrences of operators in a program 
N 2 total number of occurrences of operands in a program 
n i number of distinct operators appearing in a program 
«2 number of distinct operands appearing in a program 

and Eq has been empirically estimated to be approximately 
3000. 

Many publications have either supported or contradicted the 
results proposed by the software science theory, including a 
special issue of the IEEE Transactions on Software Engineer- 
ing (ref. 7-18). Though unconventional, the measures pro- 
posed by the software science theory are easy to compute, and 
in any case it is an alternati ve for estimating the number of bugs 
in a software system. Table 7- 1 shows a correlation coefficient 
between the real number of bugs found in a software project and 
the number predicted by the software science theory for several 
experiments. There are significant correlations with error 
occurrences in the programs, although the data reported by 
Fitzsimmons and Love (ref. 7-17) (obtained from three Gen- 
eral Electric software development projects totaling 166 280 
statements) show weaker correlation than the original values 
reported by Halstead. 


Other Models 

The model presented by Costis, Landrault, and Laprie 
(ref. 7-21) is based on the fact that for well-debugged pro- 
grams, a software error results from conditions on both the 
input data set and the logical paths encountered. We can then 
consider these events random and independent of the past 
behavior of the system (i.e„ with constant failure rate). Also, 
because of their rarity, design errors or bugs may have the same 
effect as transient hardware faults. 

The model is built on the following assumptions: 

(1) The system initially possesses N design errors or bugs 
that can be totally corrected by N interventions of the main- 
tenance team. 

(2) The software failure rate is constant for a given number 
of system design errors. 

(3) The system starts and continues operation until a fault is 
detected; it then passes to a repair state. If the fault is due to a 
hardware transient, the system is put into operation again after 
a period of time for which the probability density function is 
assumed to be known. If the fault is due to a software failure, 
maintenance takes place, during which the error may be 
removed, more errors may be introduced, or no modifications 
may be made to the software. 

The model computes the availability of the system as a 
function of time by using semi-Markovian theory. That is, the 
system will make state transitions according to the transition 
probabilities matrix, and the time spent in each state is a random 
variable whose probability density function is either assumed 
to be known or is measurable. The main result presented by 
Costis, Landrault, and Laprie (ref. 7— 2 1) is how the availability 
of the system improves (when all the design errors have been 
removed) as the design errors are being removed under some 
restrictive conditions. They show that the minimum availabil- 
ity depends only on the software failure rate at system integra- 
tion and not on the order of occurrence of the different types of 
design errors. The presence of different types of design errors 
only extends the time necessary to approach the asymptotic 
availability. 

The mathematics of the model is complex, requiring numeri- 
cal computation of in verse Laplace transforms for the transition 
probabilities matrix, and it is not clear that the parameters 
needed to simulate a real system accurately can be easily 
measured from a real system. 

Finally, some attempts have been made to model fault- 
tolerant software through module duplication (Hecht, 
ref- 7—22) and warnings about how not to measure software 
reliability (Littlewood, ref. 7-23). 

None of the preceding models characterizes system behavior 
accurately enough to give the user a guaranteed level of perfor- 
mance under general workload conditions. They estimate the 
number of bugs present in a program but do not provide any 
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accurate method of characterizing and measuring operational 
system unreliability due to software. There is a large gap 
between the variables that can be easily measured in a running 
system and the number of bugs in its software. Instead, a cost- 
effective analysis should allow precise evaluation of software 
unreliability from variables easily measurable in an operational 
system, without knowing the details of how the software has 
been written. 

Trends and Conclusions 

With software reliability being questioned as a science, 
programming process control appears to be the popular answer 
to both software reliability and software quality. Measurements 
of the programming process are supposed to ensure the genera- 
tion of an “error-free” programming product, if such an achieve- 
ment is possible. Further, quality and productivity measurements 
combined with select leading process indicators are supposed 
to fulfill the control requirements for developing quality soft- 
ware. This so-called answer is similar to a philosophy that failed 
in attempts to develop hardware reliability control. Reliability 
should be used to predict field performance. Especially with 
real-time communications and information management sys- 
tems, the field performance requirements vastly overshadow the 
field defect level requirements. How can we change the present 
popular trend (toward programming process control) to one 
that includes a probabilistic reliability approach? The answer is 
not a simple one; these models must be finely balanced so that 
a clear separation of reliability and quality can be achieved. 

The trends for reliability tasks in the large-scale integrated 
circuit (LSI) and very large-scale integrated circuit (VLSI) 
hardware areas are in the failure modes and effects analysis and 
the control of failures. The same emphasis can be placed on 
software (programming bugs or software errors). Once this is 
done, reliability models can reflect system performance due to 
hardware and software “defects” because their frequency of 
occurrence and the effects of their presence in the operation will 
be known. This philosophy focuses on the complete elimina- 
tion of critical defects and the specified tolerance level of minor 
defects. Normally, minor defects are easier to find and more 
numerous than the most critical defects and therefore dominate 
a defect-removal-oriented model. 

We conclude that the proper method for developing quality 
programming products combines quality, reliability, and a 
selective measurements program. In addition, a redirection of 
the programming development process to be based in the future 
on the criticality of defects, their number, and their budgeting 
at the various programming life-cycle phases is the dominant 
requirement. A reliability growth model will monitor and con- 
trol the progress of defect removal for the design phases and 
prove a direct correlation to actual system field performance. 
With such an approach, a system can be placed in operation at 
a customer site at a preselected performance level as predicted 
by the growth model. 


Software 

For several reasons, we have discussed software models before 
describing software. The reader should not be biased or led to 
a specific type of software. Few papers on software reliability 
make a distinction between product software, embedded soft- 
ware, applications software, and support software. In addition, 
the models do not distinguish between vendor-acquired soft- 
ware and in-house software and combinations of these. 


Categories of Software 

According to Electronic Design Magazine, the United States 
supports at least 50 000 software houses, each grossing 
approximately $500 000 per year. It is projected that software 
sales in the United States will surpass hardware sales and reach 
the $60 billion range. International competition will eventually 
yield error- free software. 

In-house and vendor-acquired soft w are can be categorized as 
follows: 

(1) Product 

(2) Embedded 

(3) Applications 

(4) Support 

Product software. — This categorization is from the view- 
point of the software specialist. Communications digital switch- 
ing systems software is included as “product software along 
with the software for data packet switching systems, text 
systems, etc. 

Embedded software.— This category comprises program- 
ming systems embedded in physical products to control their 
operational characteristics. Examples of products are radar 
controllers, boiler controls, avionics, and voice recognition 
systems. 

Applications software — This category is usually developed 
to service a company’s internal operations. The accounting area 
of this category covers payroll systems, personnel systems, etc. 
The business area includes reservations systems (car, motel), 
delivery route control, manufacturing systems, and on-line 
agent systems. 

Support software. — This category consists of the software 
tools needed to develop, test, and qualify other software prod- 
ucts or to aid in engineering design and development. The 
category includes compilers, assemblers, test executives, error 
seeders, and development support systems. 

Vendor-acquired software . — This software can be absorbed 
by the previous four categories and is only presented here for 
clarification. It includes FORTRAN compilers, COBOL com- 
pilers, assemblers, the UNIX operating system, the ORACLE 
data base system, and application packages. 
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Processing Environments 

Software can usually be developed in three ways: (1) inter- 
active, (2) batch, and (3) remote job entry. In the operational 
environment, these expand to include real time. Real-time 
development can be characteristic of both product software and 
embedded software. However, because product software and 
embedded software differ greatly in their requirements and in 
their development productivity and quality methodologies, 
they should not be combined (e.g., avionics has size, weight, 
and reliability requirements resulting in dense software of a 
type that a communications switching system does not have). 

Severity of Software Defects 

We must categorize and weigh the effects of failures. The 
following four-level defect severity classification is presented 
in terms of typical software product areas: 

(1) System unusable (generic: frequent system crashes) 

(a) Management information system (MIS) software 
defects: inability to generate accounts payable or to 
access data base; improper billing 

(b) Computer-aided design (CAD), manufacturing (CAM), 
and engineering (CAE) defects: inability to use systems; 
CAD produces incorrect designs 

(c) Telephone switching defects: frequent service outages; 
loss of emergency communications service 

(d) Data communications defects: loss of one or 
more signaling channels; unrecoverable errors in trans- 
mission; erratic service 

(e) Military system defects: success of mission jeopardized; 
inability to exercise fire control systems; loss of elec- 
tronic countermeasure capabilities 

(0 Space system defects: success of space mission jeopar- 
dized; risk of ground support team or flight crew life; loss 
of critical telemetry information 
(g) Process control defects: waste of labor hours, raw 
materials, or manufactured items; loss of control result- 
ing in contamination or severe air and water pollution 

(2) Major restrictions (generic: loss of some functions) 

(a) MIS software defects: loss of some ticket reservation 
centers or loss of certain features such as credit card 
verification 

(b) CAD/CAM/CAE defects: loss of some features in 
computer-aided design such as the update function; 
significant operational restrictions in CAM orCAE areas; 
faults produced for which there is no workaround 

(c) Telephone switching defects: loss of full traffic cap- 
ability; loss of billing 

(d) Data communications defects: occasional loss of con- 
sumer data; inability to operate in degraded mode with 
loss of equipment 


(e) Military systemdefects:significantoperational restric- 
tions; loss of intermediate fast frequency function in 
detection systems; loss of one or more antijamming 
features 

(f) Space system defects: occasional loss of telemetry data 
and communications; significant operational or control 
restrictions 

(g) Process control defects: process cannot consistently 
handle exceptions; inability to complete all process 
control functions 

(3) Minor restrictions (generic: loss of features; inability to 
effectively modify program) 

(a) MIS software defects: mishandling of records; system 
occasionally cannot handle exceptions 

(b) CAD/CAM/CAE defects: occasional errors produced 
in design system; faults produced for which there are 
workarounds 

(c) Telephone switching defects: loss of some support 
feature, such as call forwarding or conferencing 

(d) Data communications defects: occasional inability to 
keep up with data rate or requests; occasional minor loss 
of data transmitted or received 

(e) Military system defects: loss of some operational modes 

such as tracking history, monitor or slave model of oper- 
ation, multiple option selection 
(0 Space system defects: occasional loss of update infor- 
mation or frame; occasional loss of subframe synchroni- 
zation or dropouts of some noncritical measurements 
(g) Process control defects: problems that require a work- 
around to be implemented; minor reductions in rate or 
throughput; manual intervention at some points in the 
process 

(4) No restrictions (generic: cosmetic; misleading documenta- 
tion; inefficient machine/person interface) 


Software Bugs Compared With Software Defects 

Software bugs are not necessarily software defects: the term 
defect implies thatremoval orrepair is necessary, and the term 
“bug” implies removal, some degree of correction, or a certain 
level of toleration. A recent example of bug toleration from the 
telecommunications industry is contained in reference 7-24: 

It is not technically or economically feasible to 
detect and fix all software problems in a system 
as large as No. 4 Electronic Switching System 
(ESS). Consequently, a strong emphasis has been 
placed on making it sufficiently tolerant of soft- 
ware errors to provide successful operation and 
fault recovery in an environment containing soft- 
ware problems. 
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Various opinions exist in the industry about what constitutes 
a software failure. Definitions range from a software failure 
being classed as any software-caused processor restart or 
memory reload to a complete outage. One argument against 
assigning an MTBF to software-caused processor restarts or 
memory reloads is that if the system recovers in the proper 
manner by itself, there has not been a software failure, only a 
software fault or the manifestation of a software bug. From a 
systems reliability viewpoint, if the system recovers within a 
reasonable time, the event is not to be classed as a software 
failure. 


Hardware and Software Failures 

Microprocessor-based products have more refined defini- 
tions. Four types of failure may be considered: (1) hardware 
catastrophic, (2) hardware transient, (3) software catastrophic, 
and (4) software transient. In general, the catastrophic failures 
require a physical or remote hardware replacement, a manual or 
remote unit restart, or a software program patch. The transient 
failure categories can result in either restarts or reloads for the 
microprocessor-based systems, subsystems, or individual units 
and may or may not require further correction. A recent reli- 
ability analysis of such a system assigned ratios to these 
categories. Hardware transient faults were assumed to occur at 
10 times the hardware catastrophic rate, and software transient 
faults were assumed to occur at 100 to 500 times the software 
catastrophic rate. 

The time of day is of great concern in reliability modeling and 
analysis. Although hardware catastrophic failures occur at any 
time of the day, they often manifest themselves during busier 
system processing times. On the other hand, hardware transient 
failures generally occur during the busy hours as do software 
transient failures. The availability of restart times is also critical 
and in the example presented in reference 7-25, the system 
downtime is presented as a function of the MTBF of the soft- 
ware and thereboot time. When a system s predicted reliability is 
close to the specified reliability, such a sensitivity analysis must 
be performed. 

Reference 7—26 presents a comprehensive summary of 
developed models and methods that encompass software life- 
cycle costs, productivity, reliability and error analysis, com- 
plexity, and the data parameters associated with these models 
and methods. The various models and methods are compared 
in reference 7-26 on a common basis, and the results are 
presented in matrix form. 


Manifestations of Software Bugs 

Many theories, models, and methods are available for quan- 
tifying software reliability. Nathan (ref. 7—27) stated. It is 
contrary to the definition of reliability to apply reliability 


analysis to a system that never really works. This means that the 
software which still has bugs in it really has never worked in the 
true sense of reliability in the hardware sense.'* This 
statement agrees with reference 7-24, which says that large, 
complex software programs used in the communications indus- 
try are usually operating with some software bugs. Thus, a 
reliability analysis of such software is different from a reliabil- 
ity analysis of established hardware. Software reliability is not 
alone in the need for establishing qualitative and quantitative 
models. Reference 7-28 discusses the “bathtub curve” and the 
effect of recent data on electronic equipment failure rate, and 
reference 7-30 discusses the effects of deferred maintenance 
and nonconstant software and hardware fault rates. 

In the early 1980's work was done on a combined hardware/ 
software reliability model. Reference 7—30 states. The use of 
steady-state availability as a reliability/maintainability meas- 
ure is shown to be misleading for systems exhibiting both 
hardware and software faults.” The authors develop a theory 
for combining well-known hardware and software models in a 
Markov process and consider the topic of software bugs and 
errors based on their experience in the telecommunications 
field. To synthesize the manifestations of software bugs, we 
must note some of the hardware trends for these systems. 

( 1 ) Hardware transient failures increase as integrated cir- 
cuits become denser. 

(2) Hardware transient failures tend to remain constant or 
increase slightly with time after the infant mortality phase. 

(3) Hardware (integrated circuit) catastrophic failures 
decrease with time after the infant mortality phase. 

These trends affect the operational software of communica- 
tions systems. If the transient failures increase, the error analy- 
sis and system security software are called into action more 
often. This increases the risk of misprocessing a given transac- 
tion in the communications system. A decrease in the cata- 
strophic failure rate of integrated circuits can be significant, as 
described in reference 7-13, which predicts an order-of- 
magnitude decrease in the failure rate of 4K memory devices 
between the first year and the twentieth year. We also tend to 
over-simplify the actual situations. Even with five vendors of 
these 4K devices, the manufacturing quality control person 
may have to set up different screens to eliminate the defective 
devices from different vendors. Thus, the system software will 
see many different transient memory problems and combina- 
tions of them in operation. 

Central control technology has prevailed in communications 
systems for 25 years. The industry has used many of its old 
modeling tools and applied them directly to distributed control 
structures. Most modeling research was performed on large 
duplex processors. With an evolution through forms of mul- 
tiple duplex processors and load-sharing processors and onto 
the present forms of distributed processing architectures, the 


NASA/TP— 2000-207428 


105 



TABLE 7-2. — CRITICALITY INDEX 


Bug 

manifestation 

rate 

Defect 

removal 

rate 

Level 

of 

criti- 

cality 

Failure type 

Failure 

characteristic 

4 per day 

3 per month 

5 

Transient 

Errors come and go 

3 per day 

I per week 

4 

Transient 

Errors are repeated 

2 per week 

1 per month 

3 

Transient or 
catastrophic 

Service is affected 

1 per month 

2 per year 

2 

Transient or 
catastrophic 

System is partially 
down 

1 per two 
years 

1 per year 

1 

Catastrophic 

System stops 


modeling tools need to be verified. With fully distributed con- 
trol systems the software reliability model must be conceptu- 
ally matched to the software design to achieve valid predictions 
of reliability. 

The following trends can be formulated for software tran- 
sient failures: 

( 1 ) Software transient failures decrease as the system archi- 
tecture approaches a fully distributed control structure. 

(2) Software transient failures increase as the processing 
window decreases (i.e., less time allowed per function, fast 
timing mode entry, removal of error checking, removal of 
system ready checks, etc.) 

A fully distributed control structure can be configured to 
operate as its own error filter. In a hierarchy of processing levels, 
each level acts as a barrier to the level below and prevents errors 
or transient faults from propagating through the system. Cen- 
tral control structures cannot usually prevent this type of error 
propagation. 

If the interleaving of transaction processes in a software 
program is reduced, such as with a fully distributed control 
architecture, the transaction processes are less likely to fail. This 
is especially true with nonconsistent user interaction as experi- 
enced in communications systems. Another opinion on soft- 
ware transient failures is that the faster a software program runs, 
the more likely it is to cause errors (such as encountered in 
central control architectures). Some general statements can be 
formulated: 

(1) In large communications systems, software transient 
failures tend to remain constant, and software catastrophic 
failures tend to decrease with time. 

(2) In small communications systems, software transient 
failures decrease with time. 

(3) As the size of the software program increases, software 
transient failures decrease and hardware failures increase. 

A “missing link” needs further discussion. Several methods 
can be used to quantify the occurrence of software bugs. 
However, manifestations in the system’s operations are detri- 
mental to the reliability analysis because each manifestation 
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could cause a failure event. The key is to categorize levels of 
criticality for bug manifestations and estimate their probability 
of occurrence and their respective distributions. The impor- 
tance of this increases with the distribution of the hardware and 
software. Software reliability is often controlled by establish- 
ing a software reliability design process. Reference 7-24 pre- 
sents techniques for such a design process control. The final 
measure is the system test, which includes the evaluation of 
priority problems and the performance of the system while 

under stress as defined by audits, interrupts, reinitialization, and 

other measurable parameters. The missing link in quantifying 
software bug manifestations needs to be found before we can 
obtain an accurate software reliability model for measuring 
tradeoffs in the design process on a predicted performance 
basis. If a software reliability modeling tool could additionally 
combine the effects of hardware, software, and operator faults, 
it would be a powerful tool for making design tradeoff deci- 
sions. Table 7-2 is an example of the missing link and presents 
a five-level criticality index for defects. Previously, we dis- 
cussed a four-level defect severity classification with level four 
not causing errors. These examples indicate the flexibility of 
such an approach to criticality classification. 

Software reliability measurement and its applications are 
discussed in reference 7-31 for two of the leading software 
reliability models, Musa’s execution time model and 
Littlewood’s Bayesian model. Software reliability measure- 
ment has made substantial progress and continues to progress 
as additional projects collect data. The major hurdle in estab- 
lishing a software reliability measurement tool for use during 
the requirement stage is under way. 

Comparing references 7—32 and 7—31 yields an insight into 
the different methods of achieving software reliability. The 
method described in reference 7-32 concentrates on the design 
process meeting a present level of reliability or performance at 
the various project design stages. When the system meets its 
final software reliability acceptance criteria, the process is 
complete. Reference 7-31 describes a model that provides 
the design process with a continuous software reliability 
growth prediction. The Musa model can compare simulta- 
neous software developments and can be used extensively in 
making design process decisions. An excellent text on software 
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reliability based on extensive data gathering was published in 
1987 (ref. 7-33). 

We can choose a decreasing, constant, or increasing software 
bug removal rate for systems software. Although each has its 
application to special situations and systems, a decreasing 
software bug removal rate will generally be encountered. 
Systems software also has advantages in that certain software 
defects can be temporarily patched and the permanent patch 
postponed to a more appropriate date. Thus, this type of defect 
manifestation is treated in general as one that does not affect 
service, but it should be included in the overall software quality 
assessment. The missing link concerns software bug manifes- 
tations. As described in reference 7-34, until the traditional 
separation of hardware and software systems is overcome in 
the design of large systems, it will be impossible to achieve a 
satisfactory performance benchmark. This indicates that soft- 
ware performance modeling has not yet focused on the specific 
causes of software unreliability. 
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Reliability Training 1 

1. In-house and vendor-acquired software can be classified into what four categories? 


A. Product, embedded, B. Useful, embedded, 

applications, and error- applications, and 

free software harmful software 

2. Name the four categories of software reliability models. 

A. Time domain, data B. Time domain, data 

axiom, corollary, and domain, axiomatic, 

man Y and other 

3. Can the bug manifestation rate be 


C. Product, embedded, 
applications, and support 
software 


C. Time axiom, data domain, 
frequency domain, and 
corollary 


A. Equal to the defect removal rate? 

B. Greater than the defect removal rate? 

C. Less than the defect removal rate? 

D. All of the above? 


4. What are the various software processing environments? 


A. Interactive, batch, remote B. Hyperactive, batch, close job C. Interactive, batch, real job 
job entry, and real time entry, and compressed time entry, and remote time 

5. Name the four levels of severity for software defect categorizations. 


A. Generic system, functional, 
category restrictions, and 
working 


B. System unusable, major 
restrictions, minor restric- 
tions, and no restrictions 


C System unusable, system 
crashes, loss of features, 
and minor bugs 


6 ' Wh r'iT re , al - tl 7 e k syst i em u .h as l mean time between software errors of 15 days. The system operates 8 hours per day. 
What is the value of the reliability function? Use the Miyamoto model. 


A. 0.962 B. 0.999 

7. Is it always necessary to remove every bug from certain software products? 

A. Yes B. No 

8. Name the four types of hardware and software failure. 


C. 0.978 


C. Don’t know 


A. Hardware part, hardware 
board, software module, 
software plan 


B. Hardware plan, hardware 
build, software cycle, soft- 
ware type cycle 


C. Hardware catastrophic, hard 
ware transient, software cat- 
astrophic, software transient 


•Answers are given at the end of this manual. 
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Reference Document for Inspection: <4 Big Bird’s” House Concept 


What is desired: 


bird house 


For whom (client, customer, user): “Big Bird” (the tall yellow bird on Sesame Street ) 


Why: 


Why not, even Oscar the Grouch has a house. 


“Big Bird’s” General Concept 

“Big Bird” needs a house (and he’s willing to pay for it) and he wants it big enough for him to live in (he’s over 6 feet tall). He 
wants to be able to enter and leave the house comfortably, to be able to lock out the big bad wolves (even those dressed as granny), 
the materials used to be strong enough to support his weight (he’s not particularly svelte), and to be weather proofed enough to keep 
him dry and warm in stormy weather, as defined by the post office (rain, sleet, hail, snow, wind). 


Class Meeting Exercise: Requirements Inspection 

Statement of Problem: “Big Bird” has no house. Life Cyde Stage 

Done Step 1 : Build a house. Concept 

Step 2: State the kind of house desired. Requirements 

System 

Subsystem 

Done 

To be inspected 


Step 3: Make drawings of desired house. 

Design 

Step 4: Build house. 

Development 

Step 5: Walk through house 

Test 

(open doors and windows). 


Step 6: Pay for house. 

Delivery 

Step 7: Live in house. 

Operation and maintenance 


Note: At any step, perform analysis and SQUAWK if changes are needed. 
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Reference Document for Inspection System Requirements 
“Big Bird’s” House Systems Requirements 

Excuse Me, Are Those Requirements? 

Well, yes, after a bit of questioning and head scratching, the following system requirements were defined: 

1. The house shall accommodate “Big Bird” and his belongings. 

2. The house shall provide easy access to “Big Bird.” 

3. The building materials shall be strong enough to support “Big Bird” (who is, ahem, rather rotund). 

4. The building materials shall deny entrance to big bad wolves (straw definitely being out of favor). 

5. The house shall have security measures to prevent easy access to any nefarious beings intending “fowl” play. 

6. The building materials shall be weather proof and found in nature. 

7. The building materials shall be low cost (even birds have budgets). 

8. The house shall be one room. 

9. The house shall have one door. 

10. The house shall have a floor. 

1 1 . The house shall have a roof. 

12. The house shall have one window. 

13. The house shall rest on level ground beneath his tree. 

14. There will be no electricity, plumbing, heating, or air conditioning (client has feathers, candles, a bird bath, and ice cream). 

15. Client will bring his own bed (BHOB). 

16. The cost of the house shall not exceed 80 bird bucks. 
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“Big Bird’s” Requirements Checklist 


Clarity 

1. Are requirements specified in an implementation-free way so as not to obscure the original requirements? 
2 Are implementation, method, and technique requirements kept separate from functional requiremen s 
3 ! Are the requirements clear and unambiguous (i.e., are there aspects of the requirements that you do not 
understand; can they be misinterpreted)? 


Completeness 


1 . 

2 . 

3 . 

4. 

5. 

6 . 
7. 


Are requirements stated as completely as possible? Have all incomplete requirements been captured as TBD's? 

Has a feasibility analysis been performed and documented? 

Is the impact of not achieving the requirements documented? 

Have trade studies been performed and documented? , , ,„ 

Have the security issues of hardware, software, operations personnel, and procedures been addressed. 

Has the impact of the project on users, other systems, and the environment been assessed. . ... . 

Are the required functions, external interfaces, and performance specifications prioritized by need date? Are they prioritized 


by their significance to the system'? 


Compliance 

1. Does this document follow the project’s system documentation standards? 

2. Does it follow JPL’s standards? 

3. Does the appropriate standard prevail in the event of inconsistencies . 


Consistency 


1 . Are the requirements stated consistently without contradicting themselves or the requirements of related systems? 

2. Is the terminology consistent with the user and/or sponsor’s terminology? 


Correctness 

1 . Are the goals of the system defined? 


Data Usage 

1 . Are “don’t care” condition values truly “don’t care?” (“Don’t care” values identify cases when the value of a condition or flag 

ic irrelevant even thoush the value may be important for other cases.) , 

2. Are “don’t care” Condition values explicitly stated? (Correct identification of “don’t care” values may improve a design 

portability.) 


Functionality 


1 . 

2 . 


Are all functions dearly and unambiguously described? 

Are all described functions necessary and together sufficient to meet mission and system objectives 
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Interfaces 


•- Are all external interfaces clearly defined? 

2. Are all internal interfaces clearly defined? 

3. Are all interfaces necessary, together sufficient, and consistent with each other? 


Maintainability 

1 . Have the requirements for system maintainability been specified in a measurable, verifiable manner? 

. re requirements written to be as weakly coupled as possible so that rippling effects from changes are minimized? 


Performance 

(e » ‘SSSTf Sp “ if ‘ ca,ions a ” d Ihe a “ of performance degradatiot, .hat can be ,olera,ed explicitly stated 

consider timing, throughput, memory size, accuracy, and precision)? F y 

2. For each performance requirement defined, 

a. Do rough estimates indicate that they can be met? 

b. Is the impact of failure to meet the requirement defined? 


Reliability 

1* Are clearl y defined, measurable, and verifiable reliability requirements specified? 

2. Are there error detection, reporting, and recovery requirements? 

specified? 004 * 65 ' 11 ^ events <e '®" s ' n Sl e ' eve '*t upsot’ data ioss or scrambling, operator error) considered and their required responses 

4. Have assumptions about the intended sequence of functions been stated? Are these sequences required? 

5. Do these requirements adequately address the survivability after a software or hardware fault of the system from the ooint of 

view of hardware, software, operations personnel, and procedures? ^ 


Testability 

1. Can the system be tested, demonstrated, inspected, or analyzed to show that it satisfies requirements? 

. Are requirements stated precisely to facilitate specification of system test success criteria and requirements? 


Traceability 

1 . Are all functions, structures, and constraints traced to mission/system objectives? 

2. Is each requirement stated in a manner that it can be uniquely referenced in subordinate documents? 


NASA/TP— 2000-207428 


I 


“Big Bird’s” Formal Inspection Subsystem Requirements 
‘Subsystem Re quirements’ Written for Big Bird’s Approval 

■ ■ 11 _ ■ I ■ ■ I 1 1 I rriHI T VTV» I 1 


The house shall be made of wood. 


Acceptable 0 


Major Missing 
Minor Wrong 
Open issue Extra 


Typed 
Origin [ 


Defect classification 


The house shall be nailed together. 


Acceptable U 


Major Missing 
Minor Wrong 
Open issue Extra 


Typer 
Origin d 


Defect classification 


The house size shall be 4 cubits by 
4 cubits by 3 cubits. (If cubits were good 
enough for Noah, they are good enough 
for us.) 


Acceptable □ 


Major Missing 
Minor Wrong 
Open issue Extra 


Typed 

Origin d 


□ 


j 


Defect classification 


. The door shall be made of balsa wood. 


Acceptable 0 


Major Missing Type j ] 

Minor Wrong Origin I _] 

Open issue Extra Defect classification 


. The door opening shall be 4 inches by 
8 feet. 


Acceptable □ 


Major Missing 
Minor Wrong 
Open issue Extra 


Type j — ! 

Origin I . J 
Defect classification 


. The door shall have a lock and key. 


Acceptable 0 


Major 
Minor 
Open issue 


Missing 

Wrong 

Extra 


Type [I 
Origin d 


□ 


Defect classification 


. The door shall have a door knob and 
hinges. 


Acceptable □ 


Major 
Minor 
Open issue 


Missing 

Wrong 

Extra 


Typed 
Origin d 


3 


Defect classification 


8. The door shall be on the same wall 
as the window. 


Acceptable □ 


Major 
Minor 
Open issue 


Missing 

Wrong 

Extra 


Typed 
Origin d 


Defect classification 


9. The door shall be 12 meters from the mah 
jong set, shall be glued with silly putty to 
the wall, and shall play the Hallelujah 
Chorus when the doorbell is rung by the 
wolves wanting to eat Big Bird. 


Acceptable □ 


Major 
Minor 
Open issue 


Missing 

Wrong 

Extra 


Typed 

Origin £ 


D 


Defect classification 


10. The floor shall be carpeted with a silk 
comforter (cost of 100 bird bucks; 
client has cold feet). 


Acceptable 0 


Major Missing 
Minor Wrong 
Open issue Extra 


Typed 
Origin L 


□ 


Defect classification 


1 1. The roof shall be shingled. 


Acceptable 0 


Major Missing 
Minor Wrong 
Open issue Extra 


Typed 
Origin d 


0 ] 


Defect classification 


12. The shingles shall be taffy. 


Acceptable 0 


Major Missing Type [ 
Minor Wrong Origin [ 
Open issue Extra 


Defect classification 


13. The house shall be painted blue. 


Acceptable 0 


Major Missing 
Minor Wrong 
Open issue Extra 


Typed 

Origin d 


: 


Defect classification 


14. The window shall be 3 by 3 feet. 


Acceptable 0 


Major Missing Type j ZJ 

Minor Wrong Origin 1 0 

Open issue Extra Defect classification 


15. The window shall have interior locking 
wood shutters (wolf proofing). 


Acceptable 0 


Major 
Minor 
Open issue 


Missing 

Wrong 

Extra 


Typed 

Origin d 


Defect classification 


16. The window shall have a screen. 


Acceptable 0 


Major 
Minor 
Open issue 


Missing 

Wrong 

Extra 


Typed 

Origin d 


□ 


Defect classification 


17. The screen shall be made of oriental 
tissue paper. 


Acceptable 0 


Major 
Minor 
Open issue 


Missing 

Wrong 

Extra 


Typed 

Origin d 


Defect classification 
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“Big Bird’s” Formal Inspection Subsystem Requirements 
‘Subsystem Requirements’ Written for Big Bird’s Approval 

(Concluded) 


mmm 

Acceptable Q 

i 

Major Missing Type j 

Minor Wrong Origin J 1 

Open issue Extra Defect classification 

me window shall have double 
thermal glass panes. 

Acceptable Q 

or 

Major Missing Type | ] 

Minor Wrong Origin | 

Open issue Extra Defect classification 

20. The window shall be placed next to 
the flumajubit and the whupinsnapper. 

Acceptable Q 

or 

Major Missing Type f I 

Minor Wrong Origin T 1 

Open issue Extra Defect classification 

21. The house shall be insulated (see 10 
above). 

Acceptable □ 

or 

Major Missing Type J — \ 

Minor Wrong Origin | i 

Open issue Extra Defect classification 

22. The insulation shall be cough 
lozenges (Smith Brothers, cherry 
flavor). 

Acceptable Q 

or 

Major Missing Type | i ' 

Minor Wrong Origin f [ 

Open issue Extra Defect classification 

23. I he house shall have one bed (cost of 
100 bird bucks). 

Acceptable □ 

or 

Major Missing Type | 1 

Minor Wrong Origin | 1 

Open issue Extra Defect classification 

24. The cost of the house shall be 300 
bird bucks. 

L L 

Acceptable □ 

or 

i 

Major Missing Type 1 1 

Minor Wrong Origin | ! 

Open issue Extra Defect classification 


114 


NASA/TP— 2000-207428 


I 

















Chapter 8 

Software Design Improvements 

p ar t I — Software Benefits and Limitations 


Introduction 

Computer hardware and associated software have been used 
for many years to process accounting information, to analyze 
test data, and to perform engineering analysis. Now, computers 
and software control everything from automobiles to washing 
machines and the number and type of applications are growing 
at an exponential rate. The size of individual programs has 
had similar growth. Furthermore, software and hardware are 
used to monitor and/or control potentially dangerous products 
and safety-critical systems. These uses include everything from 
airplanes and braking systems to medical devices and nuclear 
plants. 

The benefits to systems of using software are reduction in 
weight, better optimization, autonomous action taken in emer- 
gencies, more features and, hence, flexibility for users of 
computer-based products, increased capabilities, better design 
analysis, and identification of the causes of problems. 

What is the benefit of weight reduction? Using a computer 
system to control aircraft and spacecraft has tremendous weight 
and cost advantages over relying upon conventional electrome- 
chanical systems and personnel (who could be better used 
elsewhere). 

Some of the questions software designers ask are. How can 
this hardware and software be made more reliable? How can 
software quality be improved? What methodology needs to be 
provided on large and small software products to improve the 
design? How can software be verified? 

Software reliability . — Software reliability includes the prob- 
ability that the program (in terms of the computer and its 
software) being executed will not deliver erroneous output. 
People have come to trust computer-generated results (assum- 


ing that the input data are correct); however, we are now 
beginning to encounter problems. Recently a manufacturer 
reported that its motherboards, which employed a particular 
IDE (integrated drive electronics) controller, “when using 
certain operating systems have the potential for data corruption 
that could manifest itself as a misspelled word in a document, 
incorrect values or account balances in accounting software, ... 
or even corruption of an entire partition or drive.” The potential 
for data errors due to software embedded in certain Pentium 
computer chips has also been discovered (ref. 8—1). 

Importance of reliability The tremendous growth in the 
use of software to control systems has also drawn attention to 
the importance of reliability. Critical life-support systems and 
flight controls on military and civilian aircraft use software. For 
example, mechanical interlocks, which prevent unsafe condi- 
tions from occurring (such as disabling power when an instru- 
ment cover is removed), are being replaced with software- 
controlled interlocks. 

The size of the software also continues to grow, making it 
more costly to find and fix errors. From a few lines of code 
20 years ago to 500 000 source lines of code (SLOC) for only 
the flight software of the space shuttle (ref. 8-2) and 
1.588 million SLOC for the F-22 fighter (ref. 8-3). The 
application of software in the automotive industry has increased 
from an 8-bit processor that controlled engine applications to a 
powerful personal computer that added more built-in diagnos- 
tics and systems controls. Also, because of its complexity, only 
1 percent of major software projects are finished on time and 
budget and 25 percent are never finished at all (ref. 8-4). 

Some problems have become apparent. There occasionally 
exists a lack of discipline in generating software; people treat 
software controls very lightly and often have not attempted to 
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predict the reliability and safety implications of their software. 
Hence, there are many potential and unrecognized pitfalls in the 
application of software that are only now being realized. Many 
serious incidents in safety-critical applications may have been 
related to software and the complex control interfaces that often 
accompany software controlled systems. One example occurred 
when “in 1983 a United Airlines Boeing 767 went into a 
4-minute powerless glide after the pilot was compelled to shut 
down both engines. This was due to a computerized engine- 
control system (in an attempt to optimize fuel efficiency) that 
ordered the engines to run at a speed where ice buildup and 
overheating occurred (ref. 8-5). 

A China Airlines A300-600R Airbus crashed in part because 
of cockpit confusion. “Essentially, the crew had to choose 
between allowing the aircraft to be governed by its automatic 
pilot or to fly it manually. Instead, they took a halfway measure, 
probably because they failed to realize that their trimmable 
horizontal stabilizer (THS) had moved to a maximum noseup 
deflection as an automatic response to a go-around command. 

It was defeating their effort to bring the aircraft’s nose down 
with elevator control ... (ref. 8-6).” 

Because of these problems, we need to ask the following 
questions: What computer system errors can occur? What are 
the risks to the system from software? Why do accidents 
involving software happen— from both the systems engineer- 
ing and the software engineering viewpoint? What are some 
software reliability or (safety) axioms that can be applied to 
software development? How can we be aware of the real risks 
and dangers from the application of software to a control and 
sensor problem? 

Software quality — How can the design of software be 
improved? Part II of this chapter. Software Quality and the 
Design and Inspection Process, will answer these questions. It 
will also discuss the following topics: useful software quality 
metrics, tools to improve software quality, software specifi- 
cations, assessing the quality and reliability of software, speci- 
fications to improve software safety, tools that affect software 
reliability and quality, factors that affect tradeoffs and costing 
when software quality is evaluated. 

Software safety . — Software development is now a key factor 
affecting system safety because of the often catastrophic effects 
of software errors. Therefore, a system can only be safe if its 
software cannot cause the hardware to create an unsafe condi- 
tion. Software safety is the effective integration of software 
design, development, testing, operation, and maintenance into 
the system development process. A safety-critical computer 
software component (SCCSC) is one whose errors can result in 
a potential hazard, loss of predictability, or loss of system 
control. System functions are safety-critical when the software 
operations that, if not performed, performed out-of-sequence, 
or performed incorrectly can result in improper control func- 
tions that could directly or indirectly cause or al low a hazardous 
condition to exist. How can this software be improved? 



Overview: How Do Failures Arise? 

Generally, we can say that all failures come from the design 
or manufacturing process or from the operation of the equip- 
ment (the computer), its associated software, and the system it 
controls (fig. 1). Software is becoming a critical source of 
failures because they often occur in unexpected ways. Through 
a long history of the design process and particularly in the 
design of mechanisms or structures, the type and severity of 
failures have become well known. Hardware failures can often 
be predicted, inspections can be set up to look for potential 
failures, and the manufacturing process can be changed to make 
a mechanical system more reliable. 

Although a small anomaly or error in the design or operation 
of a mechanical system often produces a predictable and 
corresponding failure, software is different. An incon-ect bit, a 
corrupted line of code, or an error in logic can have disastrous 
consequences. Testing a mechanical system (though not per- 
fect) can be set up to validate all “known” events; on the other 
hand, software with only a few thousand SLOC may contain 
hundreds of decision options with millions of potential out- 
comes that cannot all be tested for or even predicted. Also, 
historically the design and behavior of mechanical systems 
have been well known, so expanding the performance envelope 
of the design led to a new system that was similar to the old one. 
The behavior of the new mechanical system was predictable. 
This does not hold true for software because minor changes in 
a program can lead to major changes in output. 

Error types . The types and sources of errors that can occur 
in a computer-controlled system are presented in figure 2 and 
are described next: 

• Hardware failure in the computer: common to all electrical 
devices 

• Hardware logicerrors(in program logiccontroIlers(PLC’s)): 
mistakes in design or manufacture 

Coding errors: mistakenly written into program or program 
became corrupted 

• Requirements errors: missing, incomplete, ambiguous, or 
contradictory specifications 
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• Unintended outcome or state: logic errors in the program 
code for a given set of inputs 

• Corrupted data: partially failed sensors or errors in internal 
lookup tables 

• User interface problems: se veral sources (e.g., multiple points 
to turn off computer control of a system or keyboard buffers 
are too small) 

• Faulty software tools (e.g., finite-element structural analysis 
code generation programs): errors in logic and outputs 

• Execution problems 

0 Variations in computer architecture from platform to plat 
form causing software verified on one platform to behave 
differently on another platform 
° Faulty or difficult-to-use interfaces between computers 
or between computers and sensors 


Hardware and software failure differences. — In compari- 
son with the methods used to verify the reliability of system 
hardware components, those used for software prediction, 
inspection, testing, and reliability verification differ greatly. The 
reason for the differences is the nonphysical, abstract nature of 
software, the failures of which are almost always information 
design oversights or programming mistakes and are not caused 
by environmental stresses or cumulative damage. Furthermore, 
the design rules for mechanical systems are usual ly well known, 
a vast amount of historical data on similar systems being 
available along with mathematical models of wear, fatigue, 
electrical stress, and so forth to make life predictions. Each 
software system is often unique. Even with some code reuse, 
complexity makes reapplication difficult. Some features of 
software and hardware reliability are compared in table 8—1. 


Types of Software 

Software types are classified on the bases of timing and 
control, run methodology, and run environment. 

Timing and control. — Software risks and their impact on 
systems and data can be evaluated based on how the software 
interacts with the system, how humans interact with the system 
and the software, and whether this activity is carried on in real 
time. Factors to evaluate are whether (1 ) the software controls 
a system or just provides information, (2) real-time human 
interference and evaluation of output are allowed, (3) the soft- 
ware output time is critical or nontime critical, and (4) the data 
supplied by informational software are critical or noncritical. 
These factors are summarized in table 8-2. The reader should 
also consult MIL-STD-882C, System Safety Program 
Requirements (ref. 8-7), for types of software based on levels 
of control and hazard criticality. 

Run methodology. — Another classification of software is 
based on run methodology and includes these types: 


TABLE 8-1 —HARDWARE AND SOFTWARE FAILURE DIFFERENCES 


Category 

Reliability prediction 

Hardware 

Many mathematical models exist for predicting wear, 
fatieue life, and electronic component life. 

Software 

Reliability predictions are nearly impossible due to the 
nonrandom distribution of errors. 

Causes of failures 

Wearout, misuse, inadequate design, manufacture or 
maintenance or incorrect use can contribute to failures 

Poor design affects software (the computer system on 
which the software resides can also fail). 

Redundancy 

Hardware reliability is usually improved with 
redundancy. 

Software reliability (except possibly for multiple voting 
svstems) is not improved with redundancy. 

Hard or soft failures 

Soft failures (some degradation in service before 
complete failure) often occur due to wear, chemical 
action, electrical degradation, etc. 

Usually no soft failures occur (however, there may be 
some recovery routines that can take the system to a 
safe state, etc.) 

Maintenance 

Usually testing and maintenance improve hardware and 
increase reliability 

Software reprogramming may introduce new and 
unpredictable failure modes into the system. Reliability 
may be decreased. Any change to the code should 
require complete retesting of the software, but this is 

Reliability prediction 
methodology 

Design theory, a history of previous systems and load 
predictions all allow excellent reliability prediction 

Software reliability is a function of the development 
process. 


NASA/TP — 2000-207428 


117 













TABLE 8-2 -CLASSIFICATION OF SOFTWARE BASED ON LEVEL OF HAZARD AND CONTROL 


Software control 

Information 

Human/other control 

Real 

time 

Examples 

Autonomous control exercised 
over hazardous systems. 

Some information may be 
available but insufficient 
for real-time interference. 

May be possible but not 
desirable; often no other 
independent safety systems 

Yes 

Space shuttle main engine and solid 
rocket booster ignition sequence 

Scmiautonomous control 
exercised over hazardous 
systems. 

Real-time information is 
available to allow 
human/other system 
interaction and control. 

Possible and desirable under 
some circumstances; other 
independent safety systems or 
ability to disengage 


Aircraft terrain-following system, 
medication dispensing device, 
nuclear power plant safety systems, 
automatic go around mode in aircraft 
(override) 

Mix of computer and human 
control over hazardous 
systems. 

Real-time information is 
available to allow human 
interaction and control. 
Human control of some 
functions. 

Yes, required for some 
subsystems of operation; 
other independent safety 
1 systems 

Yes 

Aircraft fly-by- wire systems of 
unstable aircraft (example BN-2) 
where computer translates pilots 
[ control requests into feasible flight 
surface modifications 

No, but generates information 
requiring immediate human 
action. 

Complete real-time 
information presented to 
allow human control over 
hazardous systems. 

Human interaction required to 
properly control the system; 
other independent safety 
systems 

Yes 

Aircraft collision avoidance systems, 
nuclear power plant instrumentation, 
hospital patient vital signs 

No, but human action based on 
information. 

Information not presented 
in real time. Software 
does provide critical 
information. 

Human actions and decisions 
directly influenced by 
information; other checks 

No 

Statistical process control 
information of machine tools, 
historical medical information 
summaries 

No, but human action based on 
information. i 

i 

Information not presented 
in real time. Software 
does not provide critical 
information. 

Human actions and decisions 
directly influenced by the 
information 

No 

Financial and economic data 


Interactive: a program that is continuously running and 
interacting with the operator 

Batch : a single run or process of a program (often acting on 
data, such as a finite-element analysis) from which a single 
output will occur 

Remote job entry: a software environment in which programs 
are submitted or started by others from remote locations 
who usually seek a single output 

Environment — Software may be classified according to the 
environment in which it operates: 

Embedded : a computer code written to control a product; 
usually resides on a processor that is part of the product; has 
typical applications as boiler controllers, washing machine 
and automobile computer controls 
Applications : program that analyzes data; often runs as a 
batch job on a computer with limited input from the user 
once the job is submitted; operates in payroll systems, 
finite-analysis programs, material requirements planning 
(MRP) systems (to update sections) 

Support : software tools that may be considered another class 
of programs; used to develop, test, and qualify other soft- 
ware products or to aid in engineering design and develop- 
ment; has typical applications as compilers, assemblers, 
computer-aided-software engineering tools (CASE) 


Types of Computer System Errors 

The following examples are problems that have been 
observed with the application of software to control processes 
and systems. 

Spaceprobe. — Clementine I, which successfully mapped the 
Moon s surface, was to have a close encounter with a near- 
Earth asteroid. A hardware or software malfunction on the 
spacecraft resulted in a sequencing mode that triggered an 
opening of valves for four of the spacecraft’s 1 2 attitude control 
thrusters, allowing all the hydrazine propellant to be used up 
(ref. 8-8).” 

Chemical plant. — Programmers did not fully understand the 
way a chemical plant operated. The specifications stated that if 
an alarm occurred, all process control settings were to be frozen. 
The resulting computer system software released a catalyst into 
a reactor and began to increase cooling water flow to it. While 
the flow was increasing, the system received an oil sump, oil 
low alarm, and froze the flow of cooling water at too slow a rate. 
The result was that the reactor overheated and the pressure 
release valve vented a quantity of noxious fumes into the 
atmosphere (ref. 8^4) ” 

Space Shuttle. — An aborted mission nearly occurred during 
the first flight of Endeavor to rendezvous and repair an Intelsat 
satellite. The software routine used to calculate rendezvous 
firings failed to converge to a solution due to a mismatch 
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between the precision of the state- vector tables, which describes 
the position and velocity of the Shuttle (ref. 8-2). 

Airliner.— A laptop computer used by a passenger on a 
Boeing 747-400 flying over the Pacific caused the airliner’s 
navigation system to behave erratically. When the computer 
was brought to the flight deck and turned on, the navigation 
displays went crazy (ref. 8-9)7' 


Sources of Errors 

Investigating the sources of problems should take prece- 
dence over finding the errors in the software logic. Anytime an 
analog and/or an electromechanical control system is replaced 
by a computer system, many unique problems can occur. 

Organizational problems . — Determining the causes of errors 
and eliminating them requires an analysis of the procedures, 
organizational arrangements, and methodology that cause prob- 
lems with software. Figure 3 gives an overview of the following 
organizational problems: 

(1) Communication between the software programmer and 
the systems or design engineer: The designer does not know the 
software and the programmer does not know the system with all 
its potential failure modes (they do not have domain-specific 
knowledge). Programmers frequently fail to understand the 
potential for problems if certain actions do not occur in a logical 
sequence. For example, "start heater and add fluids to boiler” 
may be "logical” programming sequences, but what if the com- 
puter has a fault after the heater is started, before enough fluid 
is added to the boiler? Similarly, design and safety engineers 
frequently lack knowledge about specific software, the way it 
will control the system, and the potential for software problems. 
They treat the computer and its software as a black box with no 
regard for the consequences if the unit fails. Consequently, in 
the past, system safety engineers ignored software or looked at 
it superficially when analyzing systems. 

(2) Documentation standards for software, testing, and veri- 
fication: Many problems are caused by the practices of not 
documenting the software analysis and the procedures for 
inspection, testing, and last-minute fixes without retesting and 
reverification. Design and verification tools may not exist. 
Formal procedures for software inspection may not exist or the 
procedures may be in place but may be essentially ignored by 
the software development group. For example, a potential 
flight problem was noticed on one experiment scheduled to fly 
in space to evaluate the effects of microgravity. To correct it, 
the software was changed during a preflight checkout on a 
holiday, but the change was not verified. During the mission, 
the heaters on adevice developed only 25 percent of the needed 
power because the simple software change caused the loss of 
some mission data. 

(3) Standardization of software structure: In many organiza- 
tions, not requiring adherence to software standards contributes 
to many system failures. Trying to be elegant in writing software, 



Programmer 



Engineer 



Figure 8-3.— Sources of error based on organizational 
problems. 


using complex techniques, and neglecting internal comments 
and written documentation can seriously affect the quality of 
software and decrease its reuse. 

(4) Configuration control management over software 
changes: During software development and maintenance, 
unauthorized or undocumented changes made by a program- 
mer to fix a possible mistake may cause many problems down 
the line. Toward the end of a project, pressure to complete 
the job encourages code changes without proper review or 
documentation. 

(5) Silver bullets: Over reliance on silver bullets to solve a 
company’s software problems results in real issues being over- 
looked. One of the most difficult problems to deal with is 
unrealistic hope that an advance in software development tech- 
nology, a new code-generating tool, or object-oriented super 
code will make software generation problems disappear. This 
reliance also manifests itself when state-of-the-art techniques 
are exclusively relied upon in lieu of using good documenta- 
tion, formal requirements, and continuous interface between 
software, design, and safety personnel. 

(6) Personnel: A greater attempt to keep good programming 
talent should be made because a turnover results in a loss of 
corporate knowledge, reduces the reuse of code, and causes 
problems with software maintenance. 

(7) Software reuse: When existing software could be reused, 
many software programs are started from scratch (again with 
little control over how the code is to be written). Note that a 
careful reuse of codes has saved time and manpower. 

Design and requirements problems . — Poor analysis and 
flowdown of requirements specifications for an individual 
project can cause errors, delays, and cost overruns: 

(1) Requirements: Poorly defined requirements for a spe- 
cific software project can cause a cost overrun and increase the 
probability that code logic errors will be introduced. When real- 
time systems are developed for new applications or applica- 
tions outside the normal areas of the software engineer s 
expertise, additional requirements are needed to implement the 
basic system. Frequently discovered while the software devel- 
opment process is well underway, these requirements are often 
inconsistent, incomplete, incomprehensible, contradictory, and 
ambiguous. 
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(2) Additional features: Adding new features to the software 

is also major problem and is based on the perception that long 
after programming has started, requirements for new features 
can be added with little negative effect. However, such addi- 
tions to performance requirements adversely affect system 
software because it must be changed for each new requirement. 
Because each change risks increasing errors in the function or 
logic, design engineers must always ask, Have the require- 
ments been analyzed as a complete set? 

(3) Anticipating problems: More attention must be given to 
protecting the software-controlled system from off-nominal 
environments and to anticipating what states the system can 
reach through an unexpected series of events. Too often, the 
emphasis is on fulfilling performance requirements without 
carefully analyzing what can go wrong with the system. 

(4) Software and/or hardware interaction: Problems result 
from a lack of understanding of how the program will actually 
run once a system is operational. Software may not be able to 
process all the sensor data during a clock cycle, or it may be 
unable to deal with changes in physical conditions and 
processors. 

(5) Isolating processes: Adding too many unnecessary soft- 
ware processes on a computer controlling a safety-critical 
system can reduce assurance that critical processes will be 
handled properly (safety critical refers to systems whose fail- 
ures can cause a loss of life, the mission, or the system). 

Other problem areas . — In addition to the associated hard- 
ware, sensors, and interfaces that can also increase the risk of 
errors, other problems concern incorrect data, the reliability of 
the system, and the production, distribution, and maintenance 
of software. 

(1) Reliability: The reliability and survivability of the com- 
puter hardware, sensors, and power supplies are often not 
adequately planned for. The central processing unit (CPU), 
memory, or disk drives of a computer can fail, the system can 
lose power, excess heat or voltage spikes can cause unantici- 
pated errors in performance and output, or the system can 
completely shut down. 

(2) System and/or sensor interfaces: The interfaces between 
sensors and other mechanical devices can fail, resulting in 
damage to cables and the failure of power supplies to sensors or 
servocontrollers. Often the anticipation of these events and 
effective solutions are not handled adequately. 

(3) Radio frequency noise: The effect of radio frequency 
(RF) noise is often unanticipated. It can cause a computer 
processor, its memory, and input/output devices to operate 
improperly, or it can cause errors or erroneous readings from 
sensors, poorly shielded cables, connectors, and interface boards 
(e.g., fiber optic to digital conversion). 

(4) Manufacture and maintenance: Improper handling of the 
manufacture, reproduction, and distribution of software results 
in compilation errors and improper revisions of code being 


TABLE 8-3. — SOURCES OF ERRORS BY PERCENT 

Logic 

71 O 

Input/ourput 

14 74 

Data handling 

I 440 

Computation. 

ft 74 

Preset data base _ 

7 fn 

Documentation 

A7S 

User interface 

7 70 

Routine-to-routine interface 

. 5.62 


distributed. Integration problems can occur while assembling 
the code, linking program modules together, and transferring 
files. Poor control over maintenance upgrades of software and 
firmware alsocauses errors from improperly loading programs, 
using the wrong batch files, and patching to the wrong revision 
of software. 

A Rome Laboratories study classified errors by percentage of 
occurrence (table 8-3), which reveals the importance of inter- 
face design and documentation (ref. 8-10). 


Tools to Improve Software System Reliability and Safety 

For each of the aforementioned problem-causing agents, the 
following tools minimize risk and may even eliminate the 
problem. 

Organizational improvement . — Various tools and techniques 
properly applied and supported at all organizational levels can 
greatly improve software reliability and safety. 

(1) Communication: Improve communication between 
designers, software engineers, and safety engineers through 

concurrent engineering, safety review teams, and joint training. 

Concurrent engineering with regular meetings between design 
and software engineers to review specifications and require- 
ments will improve communications. Continuous discussions 
with the end users will help them to understand the background 

of the various system performance requirements. Joint training 

and cross training will encourage them to develop informal 
relationships and communication. Software safety review com- 
mittees consisting of design, software, and safety personnel 
who continually meet to review software specifications and 
implementation will assure that safety-critical software per- 
forms properly and that specifications be carefully written, not 
just in legal terms but with clear descriptions of how the 
system should work. 

(2) Documentation: Improve software documentation stan- 
dards, testing, and verification procedures. Encourage the 
application of standards for all software projects, including 
general requirements for all system development projects, the 
industry or military standards to be followed, and the docu- 
ments to be generated fora specific product. These documents 
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may include a software version description document (see 
part II for more details) and plans for software management, 
assurance, configuration management, requirements specifica- 
tions, and testing. 

(3) Standardization: Set and enforce software structure stan- 
dards to delineate what is and what is not allowed. The pro- 
grammer should not design a “clever” program that cannot be 
readily understood or debugged. Enforce safe subsets of pro- 
gramming language, coding standards, and style guides. 

(4) Configuration management: Implement consistent con- 
trols over software changes and the change approval process by 
using software development products that include software 
configuration management and code generation tools. 
Computer-aided-software-engineering (CASE) tools and other 
configuration management techniques can automatically com- 
pare software revisions with previous copies and limit unap- 
proved changes. Other programming tools provide mission 
simulation and module interface documentation. 

(5) Silver bullets: The introduction of major changes in the 
procedures for generating software must be scrupulously 
reviewed and their impact on the software personnel, mainte- 
nance, and standardization evaluated carefully. Major disrup- 
tions to personnel can result from any major change in the way 
a product is designed and developed; therefore, careful and 
complete training of personnel, a free flow of information about 
the new system, assurances as to the support of existing 
programmers, and the gradual introduction of the new methods 
(e.g., starting on one small project) are required. Projects 
already underway and those scheduled to begin may or may not 
benefit from the changes. 

(6) Personnel: Provide incentives to keep good program- 
ming talent and maintain the corporate knowledge base. The 
programmers should have a mix of programming skills and 
experience and the ability to transmit practical programming 
knowledge to new programmers who only have classroom 
training with little or no insight into real-world problems. 
Keeping senior programmers or senior managers who can 
review software and participate in independent verification and 
validation (IV&V) of software across missions or products is 
also beneficial, as is retaining workers who know the software 
systems that support software maintenance and new applica- 
tions of the code. Provide training in the proper methodologies. 

Software should be modularized to facilitate changes and 
maintenance. The modules should have low coupling (the 
number of links between modules is minimized) and have high 
cohesion (the level of self-containment). 

Use a “clean room approach” to develop software. This 
approach implies a highly structured programming environ- 
ment with tight control of the specifications for the software 
and system and support and adherence to the software analysis 
specifications. 

(7) Software reuse: Encourage the reuse of software with 
strict controls imposed over software structure and procedures 
for code reuse. Software modules and/or software reuse also 


improve reliability because of the benefits derived from faults 
removed in prior usage. Modularized software with well- 
documented and verifiable inputs and outputs also enhances 
maintainability. Lewis Research Center s launch vehicle pro- 
grams are reused for each mission with only minor modifica- 
tions, and excellent reliability results have been achieved. 

Design and requirements improvements — The hardware 
and the software must be integrated to work together. This 
integration includes the entire system with input sensors and 
signal conditioners, analog-to-digital (A/D) boards, the com- 
puter hardware and software itself, and the output devices 
(control actuators). Basic design methodology can improve 
software as well; thus, the following approaches support this 
concept: 

( 1 ) Requirements: Spend sufficient time defining and under- 
standing requirements. The system, software, and safety engi- 
neers should work with the end user to develop requirements, 
to express the requirements in mutually understandable 
language, and to design requirements that are testable and 
verifiable. 

(2) Additional features: Limit changes in requirements once 
the software design process begins. Question whether an addi- 
tional feature is really necessary or if, instead, functionality 
should be reduced to achieve safety and basic performance goals. 
A large number of ancillary noncritical devices and special 
graphical user interfaces may not be necessary and may only 
complicate and slow the system. 

Avoid developing a false sense of security by putting soft- 
ware in its proper place of importance. Erroneously, many 
people think that acomputer controlling a system can never fail 
and will believe computer-controlled readouts rather than rely 
on their own good senses. 

(3) Anticipating problems: Fully analyze the ways the 
software-controlled system can fail and the undesirable states 
the system can attain. Then, implement procedures and methods 
to ensure that these undesirable states and failure modes cannot 
be attained and that they are not attainable through some 
unusual (though not impossible) combinations of software 
states, environment, and/or input data. Such steps will ensure 
the system’s invulnerability to these failures. 

Use error detection, correction, and recovery software 
development to achieve fault tolerance. Examples of common 
errors include inconsistent data in data bases, process deadlock, 
starvation and premature termination, runtime failures due to 
out-of-range values, attempts to divide by zero, and lack of 
storage for dynamically allocated objects. Although software 
does not degrade, it is virtually impossible to prove the correct- 
ness of large, complex, real-time systems. The selective use of 
logic engines can be effective in reducing uncertainty about a 
system’s performance. 

Use software that can detect and properly handle runtime 
errors and software controls that assume the worst and prepare 
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for it, such as undesirable states the computer can attain and the 
ways each of these states can be prevented. Make a careful 
analysis of responses to failed or suspect sensors. 

Software capable of real-time diagnosis of its own hardware 
and sensors is very useful. Memory can be protected with 
parity, error-correcting code, and read-only circuitry in mem- 
ory. Messages received should be checked for accuracy, and 
routes can be automatically changed when errors are detected. 
Predefined system exceptions and user-defined fault excep- 
tions should be designed into application software. Predefined 
exceptions can be raised by runtime systems so the software 
should also have built-in or operating system recovery proce- 
dures. Information for recovery includes processor identifica- 
tion, process name, data reference, memory location, error 
type, and time of detection. 

(4) Software and/or hardware interaction: Computer timing 
problems and buffer overload problems must be eliminated. If 
all alarms and sensors cannot be read in one clock cycle of the 
CPU, errors may occur or alarms may be missed. Overloaded 
buffers can result in CPU lockup. 

Load balancing should be a part of the operating system 
software routines because failures are often caused by over- 
loading one or more processors in the system. A few examples 
of overloading are caused by an increase in message traffic or 
the inability of a processor to perform within time constraints. 

In these cases, a potential tool to support complex systems is 
dynamic traffic time sharing in which message streams are 
distributed among identical processors with a traffic coordina- 
tor keeping track of the relative load among processors. 

(5) Isolating processes: Systems for safety-critical applica- 
tions need to be separate from everything else. System specifi- 
cations often require gathering data from hundreds of sensors 
and performing all sorts of noncritical tasks. Segregating these 
noncritical tasks in a separate computer system will often 
improve chances that safety-critical functions will be not be 
disrupted by defects in noncritical resources. Safety-critical 
modules should be firewalled, and proven hardware and 
technology should be used for critical systems. “Flight-proven” 
older computer systems and software that do the job should be 
chosen over newer computers whose standards are rapidly 
evolving where critical applications are involved. 

Analog interlocks on safety -critical systems should be 
replaced with software interlocks only with the greatest of care. 

A thorough, well-documented analysis of what would happen 
with a computer failure and with a system failure that the 
interlock protects should also be made. An example of the 
problem of replacing mechanical interlocks with software 
interlocks involves a radiation therapy machine. An early 
model of the therapy machine had a hardware interlock to 
prevent radiation overdoses. When the interlock was removed 
on a later model and replaced with software logic, several 
people were killed from a radiation overdose. The problem was 
caused by the operator interface, poorly documented data input 


procedures, and inadequate safety procedures. The earlier 
model never experienced the problem because the program did 
not control the interlock (ref. 8-11). 

In many cases, safety-critical systems can have an analog 
process (or a stand-alone computer) capable of taking over if 
the primary computer fails. If a computer control fails on a 
process plant, an analog backup system (which is presumably 
controlled by the computer) could keep the process running 
(though at less than optimum conditions). Alternatively, con- 
trol actuators could go to a safe position if a failure occurred. 
Usually, the process must be allowed to proceed to some 
nominal conditions (e.g., partial cooling water or partial prod- 
uct inflow into a process) before shutting down. 

Monitor the health of the backup systems and the output of 
software control commands independently of the main control 
computer. A separate computer should be performing health 
checks on the main computer and on safety-critical sensor 
outputs. 

Conduct special tests to verify the performance of safety- 
critical software. This testing should verify that the software 
responds correctly and safely to single and multiple failures or 
alarms; that it properly handles operator input or sensor errors 
(e.g., data from a failed sensor); that it does not perform any 
unintended routines; that it detects failures and takes action 
with respect to entry into and execution of safety-critical 
software components; and that it is able to receive alarms or 
other inhibit commands. 

Formal methods can use abstract models and specification 
languages to develop correct requirements. Logic engines can 
be used to prove the correctness of the requirements. 

For many years, the Lewis Research Center’s launch vehicle 
program verified the software for each mission by running the 
complete program in the mission simulation lab. All the mis- 
sion constants and components were checked and verified. 
Lewis never lost a vehicle because of software problems. 

Other improvements — The hardware/software system must 
also be integrated with input sensors and signal conditioners 
(e.g., analog-to-digital boards) and the output devices (e.e., 
servocontrolled actuators). Because the reliability of all this 
hardware is also an issue, some basic approaches to total system 
performance follow: 

(1) Reliability: The reliability and survivability of the elec- 
tronic components associated with the software control system 
can be improved by properly protecting components from 
vibration, excess heat and voltage, and current spikes. Properly 
maintained grounding and shielding also must be assured with 
maintenance training and documentation. Robust sensors, 
actuators, and interfaces also contribute to a more reliable 
system. Sensor failure can cause the wrong data to be proc- 
essed. Even the fraying of cables has been linked with possible 
uncontrolled changes in aircraft flight surface actuation. The 
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reliability of computer-controlled output devices (servo- 
actuators, valves, relays) must also be verified. Because output 
devices may be subject to noise problems, error recovery and 
restart procedures should be included in software and properly 
tested. 

Passive controls should be designed so that failures cause the 
system to go to a safe state. If input commands or sensor 
readings are suspect, the system should go to a safe condition, 
which is accomplished by an analog backup or an autonomous 
software module that should be in a separate backup system. 

Multiple voting systems (multiple computers running the 
same taskin parallel with independently written programs) might 
help to improve reliability. Although this concept is beneficial 
in theory, some studies suggest that common software logic 
faults arise from common requirements. Furthermore, mainte- 
nance and configuration management of this type of system is 
greatly complicated by having different active versions of code 
(ref. 8-12). Multiple computers with software written for the 
same functional output but developed independently is one way 
to handle the critical problem of software taking the operator to 
a condition that was never intended. Systems should sense the 
occurrence of anomalies and alert the operator. Health monitor- 
ing of the controlled system and the computer itself, including 
frequent self-checks, should be part of the program. 

Redundant systems need to have separate power sources and 
locations (to avoid common mode failures). Use uninterrupted 
power supplies for critical software systems. Have battery 
backup for as long as needed to switch to manual operation. 
Avoid a common power supply that can send a surge to all 
devices at once or can shut off all devices at once. 

A distributed system can also be used to improve reliability. 
The system can sense problems in one processor and transfer its 
work to another processor or system. Hardware components 
degrade with time and represent the most important factor in 
ensuring reliability of real-time systems. However, note that 
the complexities of a distributed system can cause new prob- 
lems that possibly reduce reliability. For example, the synchro- 
nization and precision of numerical values between programs 
and communications procedures can cause errors. More re- 
sources are also consumed for coding and testing and programs 
become larger (with more chance for error). 

(2) Systemand/orsensorinterfaces:Thecomputerand sensor 

interfaces must be thoroughly tested to prevent mechanical fail- 
ures, intermittent contacts, connector problems, and noise. 
Again, provisions for data out of acceptable ranges must be 
made. 

(3) Radio frequency noise: Radio frequency (RF) noise 
problems can be avoided. Input and output data should be 
validated before use. The software should check for data out- 
side valid ranges and take appropriate action such as setting off 
an alarm or shutting down the system. Proper maintenance 
procedures and training in the removal and replacement of 
grounding and shielding should be developed. The interaction 


of and possible need for separate analog and digital grounds 
should also be investigated. Thorough system testing in all 
anticipated environments should be performed. 

(4) Manufacturing and maintenance: The duplication, load- 
ing, and maintenance of software must be planned and con- 
trolled. Procedures must be developed to assure that the proper 
code is loaded on each processor model. All new compilations 
of code must be verified. Buggy compilers can introduce defects. 
Subtle changes from one revision of an operating system to 
another can cause a difference in response to the same code. 
Procedures and requirements for maintenance upgrades must 
also be developed. The updated software should be adequately 
tested and verified (to the same level and extent and to the same 
requirements as the operating software) for accuracy (perfor- 
mance), reliability, and maintainability. New software should 
be modularized and uploaded as individual modules when 
maintenance is being performed. Also, whenever possible, issue 
firmware changes as fully populated and tested circuit cards 
(not as individual chips). 


Software Development Tools 

Several methods can be used to analyze and verify software. 

Fault tree analysis . — This can identify critical faults and 
potential faults or problems. Then, all the conditions that can 
lead to these faults are considered and diagrammed. 

Petri net analysis . — This provides a way to model systems 
graphically. A Petri net has a set of symbols that show inputs, 
outputs, and states with nodes that are either “places (repre- 
sented by circles) or “transitions” (represented by vertical lines). 
When all the places with connections to a transition are marked, 
the net is “fired” by removing marks from each input place and 
adding a mark to each place pointed to by the transition (the 
output places) (ref. 8—4). 

Hazard analysis . — This uses formal methods to identify 
hazards and evaluate software systems (ref. 8—1 1 ). 

Formal logic analyzers — These are logic engines that can 
verify specifications. Some source analyzers can reveal logic 
problems in code and branching problems. 

Pseudocodes . — These are used for program design and veri- 
fication. They are similar to programming languages but are not 
compiled. They have the flow and naming notation of program- 
ming language but have a readable style that allows someone to 
better understand program logic (ref. 8-4). 

State transition diagrams (STD’s) — These are graphs that 
show the possible states of the system as nodes and the possible 
changes that may take as lines. They can highlight poor archi- 
tecture or unnecessarily complex computer code (ref. 8-4). 

Software failure mode effects analysis (FMEA). These 
analyze what can go wrong with the software and with the 
system itself. The FMEA should analyze whether the system is 
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faul t tolerant w ith respect to hard ware fai 1 ures and make certai n 
that the system specifications are complete. The actual failure 
of the computer hardware usually results in a hard failure and 
the effects are easily identified. However, the effects of failures 
handled by software may not be so clear. For example, how 
does the software handle the loss of one piece of sensor data or 
a recovery from a fault? 


Software Safety Axioms and Suggestions 

These axioms should be read and reread and the principles 
behind them thoroughly understood. 

( 1 ) Persons who design software should not write the code 
and those who write the code should not do the testing. 

(2) Accidents are caused by incomplete or wrong assump- 
tions about the system or process being controlled. Actual 
coding errors are less frequent perpetrators of accidents. 

(3) Unhandled controlled system states and environmental 
conditions are a big cause of “software malfunctions.” 

(4) The lack of up-to-date professional standards in soft- 
ware engineering and/or the lack of the use of these standards 
is a root cause of many problems. 

(5) Changes to the original system specifications should be 
limited. 

(6) It is impossible to build a complex software system to 
behave exactly as it should under all conditions. 

(7) Software safety, quality, and reliability are designed in, 
not tested in. 

(8) Upstream approaches to software safety are most 
effective. 

(9) Software alone is neither safe nor unsafe. 

(10) Many software bugs are timing problems that are diffi- 
cult to test for. 

(11) Software often fails because it goes somewhere that the 
programmer does not think it can get to. 

(12) Software systems do not work well until they have been 
used. 

(13) Mathematical functions implemented by software are 
not continuous functions but have an arbitrary number of 
discontinuities. 

(14) Engineers believe one can design “black box tests” on 
software systems without the knowledge of what is inside 
the box. 

(15) Safety-critical systems should be kept as small and as 
simple as possible; any functions that are not safety critical 
should be moved to other modules. 

( 1 6) A software control system should be treated as a single- 
point failure (in the past the software was often ignored). 

(17) What must not happen should be decided at the outset 
and then one should make sure that the program cannot get 
there. 

( 1 8) The system should be fault tolerant and able to recover 
from faults and instruction jumps. 
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(19) Independent verification and validation (IV&V) of 
software should be used. 


Conclusions 

Software is now used in many safety-critical applications and 
each system has the potential to be a single-point failure or to 
be zero fault tolerant; that is, a single failure will cause the 

system to fail orifacomputeriscontrollingahazardous function, 

a single failure can cause a hazardous condition to exist. 

Potential problems with software are not well understood. 
Computers controlling a system (the computer hardware, the 
software, the sensors, and output devices that direct the flow of 
energy) are not a black box that can be ignored in a safety, 
reliability, or risk evaluation. However, if handled and applied 
properly, software and hardware may be used to control a 
system and thus can be a valuable design option. 

The software development process can be improved by good 
communication, documentation, standardization, and configu- 
ration management. Other major factors in proper software 
development are correct and understandable requirements. 
Factors that help to improve confidence in the system are 
anticipating problems, properly handling errors, and improv- 
ing hardware reliability. Methods to validate and improve 
software quality (and safety) are discussed in part II. 


Part II — Software Quality and the 
Design and Inspection Process 

Software Development Specifications 

Improving software with standards and controls must include 

the following: 

Robust design: making software fault tolerant 

Process controls: standardizing the software development 

process 

Design standards: standardizing the software specifications 
Inspection: standardizing the software requirements 
inspection process 

Code inspection: standardizing the software code inspec- 
tion process 

Precise and easily readable documentation and specifica- 
tions are necessary for a successful software project. Ideally, 
formal methods and specifications language should be used and 
once written, must be understood and adhered to. To accom- 
plish this process requires team participation in document and 
specification generation and also real support of the specifica- 
tions, documentation, and the verification of software con- 
formance and validation by upper management and the team. 
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Some of these documents and related practices should include 

(1) A formal software management plan that includes the 
software development cycle, the configuration manage- 
ment plan, approval authority, and group charter and 
responsibilities. (This plan would specify what other 
documentation is required, how interfaces are to be 
controlled, and what the quality assurance and verifica- 
tion requirements are.) 

(2) A formal software design specification that includes 
architecture specifications and hardware interfaces 

(3) A software development plan that describes develop- 
ment activities, facilities, personnel, activity flow, and 
the development tools for software generation 

(4) A plan for formal inspection of software that includes 

(a) a software quality assurance plan to integrate hard- 
ware and software safety, quality, and reliability 

(b) a software verification test specification 

(c) a software fault tolerance and failure modes and 
effects analysis specification 

(5) A software safety program plan that includes a software 
safety handbook and reliability practices specifications 

(6) A formal plan for maintenance and operation 

(7) Configuration management and documentation plans 
that specify recording all changes to software and the 
reasons for the changes. (Records should include design 
changes that require software modifications or any 
change in the functional capabilities, performance 
specifications, or allocation of software to components 
or interfaces.) 

(8) Interface control documentation that specifies linking 
hardware and software, vendor-supplied software, and 
internally generated software 

(9) Failure review boards to review bugs, the bug removal 
process, and the overall effect of bugs on the system 

(10) Lessons learned to be used to document problems and 
the solutions to eliminate repetition of errors 

(11) Test plans that will, to the greatest extent possible, 
validate the software system 

Once these documents are developed and the procedures set 
up, they must be implemented, enforced, and maintained. A 
software system safety working team (multidisciplined) can 
assist software engineering and continually monitor adherence 
to the documentation. They also have to engender respect for 
the need to follow the specifications, not mandate them and walk 
away. Therefore, the team and software engineering manage- 
ment must educate programmers in the understanding and use 
of specifications (ref. 8-13). 

Specifications and Programming Standards 

Structured programming with a well-defined design approach 
and extensive commenting benefits the software design 


process. Standardizing formats, nomenclature, language, 
compilers, and platforms for the software contributes to project 
success as well. Besides many excellent internal company 
standards for software development, a number of documents 
exist to help in the standardization and to gauge the maturity of 
software development. Some of these documents are 

(1) The Software Engineering Institute (SEI) Capability 
Maturity Model (CMM) is a method for assessing the software 
engineering capabilities of development organizations. It evalu- 
ates the level of process control and methodology in developing 
software and is designed to rank the “maturity” of the company 
and its ability to undertake major software development projects. 

(2) ISO 9000-3 Software Guidelines, Part 3, Guidelines 
for the application of ISO 9001 to the development, supply,and 
maintenance of software is intended to provide suggested 
controls and methods. 

(3) IEEE Software Engineering Standards Collect- 
ions include 22 standards (1993 edition) covering terminology, 
quality assurance plans, configuration management, test 
documentation, requirements specifications, maintenance, 
metrics, and other subjects. 

(4) NASA-developed software standards include NSS 
1740.13, INTERIM, June 1994, NASA Software Safety Stan- 
dards that expands on the requirements of NAS A Management 
Instruction (NMI) 2410.10, NASA Software Management 
Assurance and Engineering Policy. These documents contain a 
detailed reference document list. 

(5) DOD Standards include MIL-STD-882C, System Safety 
Program Requirements (ref. 8-7), DOD-STD-2 1 67 A. Defense 
System Software Development (MIL-STD-498) (ref. 8-14), 
software development (e.g., ref. 8— 1 5), and Documentation, and 
numerous other standards and guidelines (for reference only). 


NASA Software Inspection Activities 

We now want to focus on one area of the software docu- 
mentation, testing, inspection, and qualification process: the 
software inspection activity. This inspection process includes 
(1) metrics, (2) software inspection training, and (3) formal 
software inspection. Inspection activities include 

• Implementation of requirements 

• Review of pseudocode 

• Review of mechanics 

• Review of data structure 

• “Walkthrough” of code 

• Verification and validation 

• Independent verification and validation 

The objectives of formal inspection include (1) removing 
defects as early as possible in the development process, (2) 
having a structured, well-defined review process for finding 
and fixing defects, (3) generating metrics and checklists used to 
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improve quality, (4) following total quality management (TQM) 
techniques such as working together as a team, and (5) taking 
responsibility for a work product shared by the author’s peers. 

To achieve these objectives, specifications must be review- 
able, formally analyzable, and usable by the designers and the 
assurance and safety engi neers. Furthermore, the specifications 
must support completeness and robustness checks and they 
must support the generation of mission test data. 

Formal design requirements and inspections — The objec- 
tive of inspection is to remove defects at the earliest possible 
point in the product development life cycle. The product can be 
a document, a process, software, or a design. Inspection topics 
include requirements, design requirements, detailed design 
requirements, source code, test plans, procedures, manual 
standards, and plans. 

Inspection is a very structured process which requires that 
team members, who are involved because of their technical 
expertise, be sincerely interested in the software product. 
Ratherthan being viewed as a fault-finding mission, the inspec- 
tion should be considered a tool to help the author identify and 
correct problems as early as possible in the development 
process. The inspection should also help to foster a team 
environment by emphasizing that everyone is involved to 
develop a high-quality product. 

Metrics (minor errors discovered, major errors discovered) 
generated during this process are used to monitor the type of 
software defects discovered and to help prevent their recur- 
rence (refs. 8-16 to 8-18). 

Process overview.— Staff, procedures, development time, 
and training are applied to a developing software product to 
improve its quality. The formal seven-step program for inspec- 
tion includes 

(1) The planning phase: organizing for the inspection 

(2) The training phase: background and details of the inspec- 
tion activity given to team members 

(3) The preparation phase: review of the work by individual 
inspectors prior to the joint inspection meeting 

(4) The inspection meeting: defects identified, classified, and 
recorded by the team 

(5) The “third hour” (cause phase): offline discussions held 
by programmers to get help with defects 

(6) The rework phase (corrective action): defects corrected 
by programmers 

(7) The followup phase: revisions reviewed and verified by 
the team 

Roles. Each person who participates in the inspection per- 
forms various tasks: 

• Moderator: coordinates the inspection process, chairs the 
inspection meetings, and ensures that the inspection is 
conducted 


Reader, presents the work product to the inspection team 
during the meeting (the programmer (author) does not give 
the presentation) 

Recorder: documents all the defects, open issues, and action 
items brought forward during the meeting 
Inspector: helps to identify and evaluate defects (the respon- 
sibility of every person at the meeting) 

Development process benefits.— Some of the benefits of 
formal inspection for the overall software development process 
are that it 

• Improves quality and saves cost through early fault detection 
and correction 

• Provides a technically correct base for the following devel- 
opment phases 

• Contributes to project tracking 

• Improves communication between developers 

• Aids in project education of personnel 

• Provides structure for in-process reviews 

Inspection also benefits the software developer in a number 
of ways: 

Reduces defects made by the author because they are identi- 
fied early in the product life cycle 
Identifies efficiently any omissions in the requirements 
Provides constructive criticism of and guidance to the pro- 
grammer by the inspection team in private rather than by 
tearing down software in open public project design reviews 

Provides a constructive atmosphere for the entire team because 

of lessons learned from others’ mistakes 
Implements improved project tracking with inspection mile- 
stones embedded in the project 
• Improves understanding of the overall project and engenders 
communication and teamwork by bringing together project 
persons from varied backgrounds 

Trains new members of the software development team by 
working with senior team members 

Figure 4 presents the waterfall flowchart of the software 
development process (based on phases in MIL-STD-^98, 
Defense System Software Development, ref. 8-14). The fol- 
lowing acronyms are used: 

CDR critical design review 

CSCI computer software configuration item (major computer 
software PROGRAM) 

CSU computer software unit (program module) 

FCA functional configuration audit 
I software inspections 

IV&V independent verification and validation activity 


126 


NASA/TP— 2000-207428 


I 


PCA physical configuration audit 
PDR preliminary design review 
SDR system design review 
SRR system requirements review 
SSR software specification review 
SW computer software 
TRR test readiness review 
V&V verification and validation activity 

Basic rules of inspection .—These basic rules must be fol- 
lowed if the software inspection process is to be effective. 

(1) Inspections are in-process reviews conducted during 
the development of a product in contrast to milestone reviews 
conducted between development phases. 

(2) Inspections are conducted by a small peer team, each 
member of which has a special interest in the project success. 

(3) Managers are not involved in the inspection and its 
results are not used as a tool to evaluate developers. 

(4) The moderator leads the inspection and must have 

received formal training to do so. 

(5) Each team member, in addition to being an inspector, is 

assigned a specific role. 

(6) The inspection is spelled out in detail and no step of the 
process is omitted. 

(7) The overall time of the inspection is preset to aid in 
meeting the schedule. 

(8) Checklists are used to help identify defects. 

(9) Inspection teams should work at an optimal rate, the 
object of the meeting being to identify as many defects as 
possible— not to cover as many pages as possible. 

(10) Inspection metrics are detect type, number, and time 
spent on inspections. These metrics are used to improve the 
development process and the work product and to monitor the 
inspection. 

Results of software inspections , — Formal inspections save 
costs because fixing defects early in the development cycle is 
less costly than removing them later; they train team members 
and provide them with a valuable development tool as lessons 
learned from their participation in the bug identification and 
removal process; and they improve developer and development 
efficiency and lead to higher quality. 

Fixing a defect found through inspections costs on the 

averageless than 1 hour per defect; fixing a defect found during 

software testing typically takes from 5 to 1 8 hours. Another cost 
factor is that defects tend to amplify. One defect in require- 
ments or design may impact multiple lines of code. For example, 

a small study conducted by the Jet Propulsion Laboratory (JPL) 
found an amplification rate of 1 to 15, which means that 1 defect 
in the requirements impacts 1 5 source lines of code (SLOC), as 
seen in figure 5 (information taken from ref. 8 — 19). 


Inspections were also used at IBM Federal Systems to 
develop software for the space shuttle. The original defect rate 
of 2.25 defects per thousand lines of code (KLOC) was unac- 
ceptable. Over a 3-year period, inspections were applied on 
requirements, design, code and test plans, specifications, and 
procedures. The goal for this effort was 0.2 defect per KLOC. 
With inspections, the project was able to surpass the goal and 
attain a defect rate of 0.08 defect per KLOC. 

One of the most essential lessons learned from the initial 
implementation of the inspection process is that all inspection 
participants require some type of training. Everyone needs to 
understand the purpose and focus of inspections and the 
resources required to support the process. Adequate time has to 
be provided for inspections in the software development pro- 
cess. Furthermore, using metrics from inspections provides an 
excellent basis for monitoring both the inspection and develop- 
ment process and for evaluating process improvements. 

Another lesson learned is that a formal inspection requires 
projects to have an established development life cycle, an 
established set of documents produced during the phases of the 
life cycle, programming standards, and software development 
standards (e.g., NASA Software Assurance Standard, N AS A- 
STD— 2201— 93, which states that “Software verification and 
validation activities shall be performed during each phase of the 
software life cycle and shall include formal inspections.”). 

Additional benefits of formal inspections to the project are 
that they can be used with any development methodology 
because no matter which development process or life cycle is 
used, products being produced can be inspected; they are 
applied during the development of work products and are a 
compliment to milestone or formal reviews but are not intended 
to replace them; they are recommended by the NASA Software 
Assurance Standard and can be applied to the work products 
called out in the NASA Software Documentation Standard 
(refs. 8-20 to 8-22). 

Additional Recommendations 

On the basis of an evaluation of the space shuttle software 
development process, the following recommendations were 
made (ref. 8-13): 

(1) Verification and validation (V&V) inspections by con- 
tractors should pay close attention to off-nominal cases (crew 
and/or ground error, hardware failure, software error condi- 
tions), should focus on verifying the consistency in the levels of 
descriptions for modules with the consistency in module require- 
ments and the design platform, should assure correctness with 
respect to the hardware and software platforms, and should 
maintain the real independence of independent verification and 
validation (IV&V). 

(2) The project should have sufficient personnel trained in 
system reliability and quality assurance (SR&QA) to support 
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Figure 8-5.— Amplification of requirements into source code. 
Average amplification ratio, 1:15 (from ref. 8-19). 


soft ware -related activities and provide oversight and evalua- 
tion of software development activities by the individual 
SR&QA offices. 

(3) The same standards and procedures should be provided 
and enforced for multiple centers on the same program. Con- 
sistent software development coding guidelines should be 
provided to contractors. 

(4) Visibility for potential software problems should be 
provided by defining detailed procedures to report software 
reliability, quality assurance (QA), or safety problems to the 
program-level organization. 

(5) Accepted policies and guidelines should be provided for 
the development and implementation of software V&V, 
IV&V, assurance, and safety. These should include a well- 
documented maintenance and upgrade process. 

(6) Sufficient resources, personnel, and expertise should be 
provided to develop the required standards. Also, sufficient 
resources, manpower, and authority should be used to compel 
development contractors to verify that proper procedures were 
followed. 

(7) Lessons learned in the development, maintenance, and 
assurance of software should be recorded for use by other 
programs (refs. 8-23 to 8-26). 

(8) The information that each development and oversight 
contractor is responsible for making available to the commu- 
nity as a whole should be precisely identified. Mechanisms 
should be in place toensure that programs be given all informa- 
tion needed to make intelligent implementations of software 
oversight functions. 


Conclusions 

The overall software design process will be improved by 
carefully constructing the initial documentation to generate 
real and usable requirements. Requirements must be capable of 
being verified by inspection and test. 


Software product assurance activities include formal inspec- 
tion, production-quality metrics, software inspection training, 
a code “walkthrough,” verification and validation, and 
independent verification and validation. These activities are 
making NASA projects more successful. 
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Reliability Training 


Read the “Reference Document for Inspection: ‘Big Bird’s' House Concept (found at the end of ch. 7). The class meeting exercise 
explains what has to be done and the reference document explains the system requirements. The ‘“Big Bird’s’ Requirements 
Checklist” gives the classifications for the inspection. Complete the ‘“Big Bird s Formal Inspection Subsystems Requirements, 
and send it to the instructor to grade. A score of 70 percent correct will qualify you for a certificate (e.g., item 1, 2-acceptable, 
item 3-squak, a cubic is about 17 inches, major, wrong, correctness, system). 
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Chapter 9 

Software Quality Assurance 


Concept of Quality 

Let us first look at the concept of quality before going on to 
software quality. The need for quality is universal . The concepts 
of “zero defects” and “doing it right the first time” have changed 
our perspective on quality management from that of measuring 
defects per unit and acceptable quality levels to monitoring the 
design and cost-reduction processes. The present concepts 
indicate that quality is not free. One viewpoint is that a major 
improvement in quality can be achieved by perfecting the 
process of developing a product. Thus, we would characterize 
the process, implement processes to achieve customer satisfac- 
tion, correct defects as soon as possible, and then strive for total 
quality management. The key to achieving quality appears to 
have a third major factor in addition to product and process the 
environment. People are important because they make the 
process or the product successful. Figure 9-1 represents the 
union of these three factors. 

The term “software quality” is defined and interpreted differ- 
ently by the many companies involved in producing program- 
ming products. To place the subject in perspective, we present 
principles and definitions for software quality from several 
source materials: 

( 1 ) The purpose of software quality assurance is to assure the 
acquisition of high-quality software products on schedule, within 
cost, and in compliance with the performance requirements 

(ref. 9-1). . 

(2) The developer of a methodology for assessing the quality 
of a software product must respond to various needs. There can 
be no single quality metric (ref. 9-2). 

(3) The process of assessing the quality of a software product 
begins when specific characteristics and certain of the metrics 
are selected (ref. 9—3). 

(4) Software quality can be defined as (a) the totality of 
features and characteristics of a software product that bear on its 


ability to satisfy needs (e.g.. conform to specifications), 
(b) the degree to which software possesses a desired combina- 
tion of attributes, (c) the degree to which a customer or user 
perceives that software meets his or her expectations, and 
(d) the composite characteristics of software that determine the 
degree to which the software in use will meet the expectations 
of the user. 

We can infer from these statements and other source mate- 
rials that software quality metrics (e.g., defects per 1000 lines 
of code per programmer year, 70 percent successful test cases 
for the first 4 weeks, and zero major problems at the prelimi- 
nary design review) may vary more than hardware quality 
metrics (e.g., mean time between failures (MTBF) or errors per 
1000 transactions). In addition, software quality management 
has generally focused on the process whereas software reliabil- 
ity management has focused on the product. Since processes 
differ for different software products, few comparative bench- 
marks are available. For hardware in general, benchmarks 
have been available for a long time (i.e., MIL— HDBK— 217E 
series (ref. 9-4) for reliability). Recently, Rome Air Develop- 
ment Center (RADC), the sponsor of MIL-HDBK-2 1 7E, 
sponsored a software reliability survey that was intended to 
give software quality the same status as that of hardware. 

° The next step is to discuss the process of achieving quality 
in software and how quality management is involved. The 
purpose of quality management for programming products is 
to ensure that a preselected software quality level be achieved 
on schedule and in a cost-effective manner. In developing a 
quality management system, the programming product s criti- 
cal life-cycle phase reviews provide the reference base for 
tracking the achievement of quality objectives. The guidelines 
for reliability and maintainability management of the 
International Electrotechnical Commission (IEC) system 
life-cycle phases follow: 
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(1) Concept and definition: The need for the product is 
decided and its basic requirements defined, usually in the form 
of a product specification agreed upon by the manufacturer and 
user. 

(2) Design and development: The product hardware and 
software are created to perform the functions described in the 
product specification. This phase will normally include the 
assembly and testing of a prototype product under laboratory- 
simulated conditions or in actual field trials and the formulation 
of detailed manufacturing specifications and instructions for 
operation and maintenance. 

(3) Manufacturing, installation, and acceptance: The design 

is put into production. In the case of large, complex products, 
the installation of the product on a particular site may be 
regarded as an extension of the manufacturing process. This 
phase will normally conclude with acceptance testing of the 
product before it is released to the user. 

(4) Operation and maintenance: The product is operated for 
the period of its useful life. During this phase, essential preven- 
tive and corrective maintenance is performed, product en- 
hancements are made, and product performance is monitored. 
The useful life of a product ends when its operation becomes 
uneconomical because of increasing repair costs, it becomes 
technically obsolete, or other factors make its use impractical. 

(5) Disposal: The product reaches the end of its planned 
useful life or the requirement no longer exists for the product, 
so it is disposed of, destroyed, or modernized, if economically 
feasible. 

The quality of the programming product can be controlled in 
the first three life-cycle phases to achieve the expected level of 
performance of the final product. When the fourth phase 
(operation and maintenance) has been entered, the quality of 
the software is generally fixed. With these five life-cycle phase 
boundaries in place, we can conceptualize what can be imple- 
mented as “programming quality measurement.” If the phases 
and activities are the X- and T-coordi nates, the individual 
quality metrics can be placed on the Z-axis as shown in 
figure 9-2. 


Y 



Figure 9-2.— Programming quality measurement map. 


Without stating the specific activities for each phase, we can 
discuss the generalities of software quality and its cost. The cost 
of implementing quality increases with distance along the 
X-axis. Activities can be arranged along the T-axis so that the 
cost of quality increases with distance along the T-axis. With 
this arrangement, we can establish rigorous quality standards 
for the individual quality metrics as a function of cost effective- 
ness (e.g., error seeding— the statistical implanting and removal 
of soft ware defects — may be expensive). Other quality metrics 
(e.g., test case effectiveness) may cost significantly less and 
could be selected. 

In general, for a programming product, the higher the level of 
quality, the lower the costs of the product’ s operation and main- 
tenance phase. This fact produces an incentive for implement- 
ing quality metrics in the early design phases. The programming 
industry has traditionally required large maintenance organiza* 
tions to correct programming product defects. Figure 9-3 
presents a typical phase-cost curve that shows the increased costs 
of correcting programming defects in the later phases of the 
programming product’s life cycle. Note that the vertical axis is 
nonlinear. 


Software Quality 

The next step is to look at specific software quality items. 
Software quality is defined in reference 9^t as “the achieve- 
ment of a preselected software quality level within the costs, 
schedule, and productivity boundaries established by manage- 
ment.” However, agreement on such a definition is often 
difficult to achieve. In practice, the quality emphasis can 
change with respect to the specific product application environ- 
ment. Different perspectives of software product quality have 
been presented over the years. However, in today’s literature, 
there is general agreement that the proper quality level for a 
particular software product should be determined in the 
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and and and and 

definition development installation maintenance 


Life-cycle phases 

Figure 9-3. — Increasing costs of programming defects. 


concept and definition phase and that quality managers should 
monitor the project during the remaining life-cycle phases to 
ensure the proper quality level. 

The developer of a methodology for assessing the quality of 
a software product must respond to the specific characteristics 
of the product. There can be no single quality metric. The 
process of assessing the quality of a software product begins 
with the selection of specific characteristics, quality metrics, 
and performance criteria. 

The specifics of software quality can now be addressed with 
respect to these areas: 

( 1 ) Software quality characteristics 

(2) Software quality metrics 

(3) Overall software quality metrics 

(4) Software quality standards 

Areas ( 1 ) and (2) are applicable during the design and develop- 
ment phase and the operation and maintenance phase. In 
general, area (2) is used during the design and development 
phase before the acceptance phase for a given software product. 

Software Quality Characteristics 

A software quality characteristic tree is presented in refer- 
ence 9-5. The authors assume that different software products 
require different sets of quality characteristics. A product that 
has a rigorous constraint on size may sacrifice the maintain- 
ability characteristic of the software to meet its operational 
program size goals. However, this same product may need to be 
highly portable for use on several different processors. In 
general, the primary software quality characteristics are 
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Figure 9-4.— Management's view of quality. 


TABLE 9-1. — APPLICATION-DEPENDENT 
SOFTWARE QUALITY CHARACTERISTICS 


Characteristic 

Application 

Importance 

Maintainability 

Aircraft 

High 


Management information 

Medium 


systems 



Testbeds 

Low 

Portability 

Spacecraft 

Low 


Testbeds 

High 


(1) Maintainability 

(2) Portability 

(3) Reliability 

(4) Testability 

(5) Understandability 

(6) Usability 

(7) Freedom from error 

Management’s view of software quality is the quality charac- 
teristics. Established criteria for these characteristics will pro- 
vide the level of quality desired. The quantitative measures 
(metrics) place the quality at the achieved level. This concept 
is shown in figure 9—4. 

Software quality criteria and metrics are directly related to 
the specific product. Too often, establishing the characteristic 
and the metric in the early life-cycle phases without the proper 
criteria leads to defective software. An example of the charac- 
teristics and their importance for various applications is pre- 
sented in table 9-1. 


Software Quality Metrics 

The entire area of software measurements and metrics has 
been widely published and discussed. Two textbooks 
(refs. 9-6 and 9-7) and the establishment of the Institute for 
Electrical and Electronics Engineers (IEEE) Computer 
Society’s working group on metrics, which has developed a 
guide for software reliability measurement, are three examples 
of such activity. Software metrics cannot be developed before 
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TABLE 9-2— MEASUREMENT OF SOFTWARE QUALITY CHARACTERISTICS 


Characteristic 


Software life-cycle phase 

3 

4 

5 

7 

9 

Product 

definition 

Top-level 

design 

Detailed 

design 

Testing and 
integration 

Maintenance 

and 

enhancements 

Maintainability 

... 

(a) 

1 < a > 



(b) 

Portability 








.. 

. 

Reliability 

(a) 





(b) 


Testability 










Test case completion 

— 









Estimate of bugs 










remaining 










Understandabiiity 

(a) 



1 

f 





Usability 

(a) 


r j 

- . . 



r 



Freedom from error 

a Whl“TR nimlitu /tharantA. 

— 


- 

(a). 

(c) 

(a), (c) 


r 


Where quality characteristic should be measured. 
Where impact of poor quality is realized. 

Metric can take form of process indicator. 


TABLE 9 3 —MEASUREMENTS AND PROGRAMMING PRODUCT LIFE CYCLE 


System life- 
cycle phase 

Software life- 
cycle phase 

Order of precedence I 

Primary 

Secondary 

Concept and 
definition 

Conceptual planning (1) 
Requirements definition (2) 
Product definition (3) 





Quality metrics 2 


Design and 
development 

Top-level design (4) 
Detailed design (5) 
Implementation (6) 

Quality metrics 
Quality metrics 
Process indicators b 

Process indicators 
Process indicators 
Quality metrics 

Manufacturing and 
installation 

Testing and integration (7) 
Qualification, installation, 
and acceptance (8) 

Process indicators 
Performance measures^ 

Performance measures 
Quality metrics 

Operation and 
maintenance 

Maintenance and 
enhancements (9) 

Performance measures 



Disposal 

Disposal (10) 




^Metrics— qualitative assessment, quantitative prediction, or both. 
Indicators— month-by-month tracking of key project parameters. 
Measures quantitative performance assessment. 


the cause and effect of a software defect have been established 
for a given product with relation to its product life cycle. 

Table 9-2 is a typical cause-and-effect chart for a software 
product and includes the process indicator concept. At the 
testing stage of product development, the evolution of software 
quality levels can be assessed by characteristics such as free- 
dom from error, completion of a successful test case, and estimate 
of the software bugs remaining. These process indicators can be 
used to predict slippage of the product delivery date, the 
inability to meet original design goals, or other development 
problems. 


When the programming product enters the qualification, 
installation, and acceptance phase and continues into the mainte- 
nance and enhancements phase, the concept of performance is 
important in the quality characteristic activity. This concept is 
shown in table 9-3, where the aforementioned 5 IEC system 
life-cycle phases have been expanded into 10 software life- 
cycle phases: 

(1) Conceptual planning: The functional, operational, and 
economic context of the proposed software is understood and 
documented in a product proposal. 
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(2) Requirements definition: A product proposal is expanded 

into specific product requirements and the requirements, such 
as performance and functional capabilities, are analyzed and 
translated into unambiguous developer-oriented terms. 

(3) Product definition phase: Software engineering prin- 
ciples, technical information, and creativity are used to describe 
the architecture, interfaces, algorithms, and data that will satisfy 
the specified requirements. 

(4) Top-level design: The functional, operational, and per- 
formance requirements are analyzed and designs for system 
architecture, software architecture, interfaces, and data are 
created and documented to satisfy requirements. 

(5) Detailed design: The functional, operational, and perfor- 
mance requirements are analyzed and designs for system 
architecture, software architecture, components, interfaces, 
and data are further created, documented, and verified to satisfy 
requirements. 

(6) Implementation: The software product is created or 
implemented from the software design and the faults are 
detected and removed. 

(7) Testing and integration: Software elements, hardware 
elements, or both are combined into an overall system or an 
element of a system, and the elements are tested in an orderly 
process until the entire system has been evaluated, integrated, 
and tested. 

(8) Qualification, installation, and acceptance: A software 
product is formally tested to assure the customer or the customer’ s 
representative that the product meets its specified require- 
ments. This phase includes all steps necessary to deliver, install, 
and test a specific release of the system software and its 
deliverable documentation. 

(9) Maintenance and enhancements: The product is ready 
for serving its designated function, is monitored for satisfactory 
performance, and is modified as necessary to correct problems 
or to respond to changing requirements. 

(10) Disposal: The product reaches the end of its planned 
useful life or the requirement no longer exists for the product 
and it is disposed of, destroyed or modernized, if economically 
feasible. 


Overall Software Quality Metrics 

Several overall software quality metrics have been put into 
practice and have effectively indicated software quality. Jones 
(ref. 9-8) presents an overall quality metric called defect 
removal efficiency. The data collected for the overall quality 
metric are simplified to the more practical expression of 
“defects per 1000 lines of source code.” 

A second overall quality metric is based on the concept of 
quality prisms (refs. 9-9 and 9-10), which considers the extent 
of effort with which a given quality characteristic has been 
implanted into a product and the degree of effort for quality that 
has occurred in each life-cycle phase. An example of the extent 


and degree of effort is presented in table 9-4 for any given 
quality characteristic. From the table, 

( 1 ) Each quality characteristic can have a matrix similar to 
this with a specific quality program tailored to a company s 
products. 

(2) The quality effort is extended to each of the product s 
life-cycle phases to the degree desired by the company. 

(3) For each level, as the complexity and difficulty of a 

characteristic requirement increase, the intensity of the test and 
verification program effort increases. 

(4) This matrix will change for each characteristic in accor- 
dance with company emphasis. 

(5) Traditionally, the quality levels of a product correspond 
to degrees of effort. However, this matrix extends the effort to 
all phases of the product’s life cycle. 

As an example of using the matrix shown in table 9—4, a 
characteristic such as reliability may be targeted to reach service 
level 2. Then throughout planning, design, testing, integration, 
and installation, the reliability should achieve at least level 2. 
These indicators are tied to the proper major phase review 
points of a product’s life cycle. For most characteristics, the 
planning level should be achieved after the preliminary design 
review (PDR); the design level, after the development phase or 
at the critical design review (CDR); the integration level, after 
integration at the qualification testing; and the service level, 
during the operational service reviews. 

Now, quality management can apply this matrix to each 
characteristic in a manner depending on how critical it is to 
ensure achievement of the characteristic. For example, the 
reliability goal for a key system may be 10 or fewer mishandled 
calls per week, but the reliability goal for a private branch 
exchange (PBX) may be only 5 mishandled calls per month. 
These objectives may cause quality management to define a 
planning 2, design 2, integration 2, and service 2 program for 
the key system and a more demanding planning 4, design 3, 
integration 3, and service 3 program for the PBX. 

In this manner, the quality characteristics are clearly identi- 
fied by detailed criteria that set the scope of and limit the required 
objectives. Once these objectives are identified, a quality pro- 
gram can be determined to define the specific required defini- 
tion, design, test, and measurement efforts. No longer are 
nebulous measurements made against vague objectives in the 
service phase of a product’s life cycle in a last-minute attempt 
to improve quality. 

The program for pursuing quality characteristics must be 
established early. If a particular quality characteristic is not 
pursued to a reasonable extent in the planning and design 
phases, a maximum degree of effort (4) may not realistically be 
achieved in the service phase. Conversely, the more uniformly 
and consistently a quality characteristic is pursued, the more 
achievable and figuratively stable the characteristic. This is 
graphically represented for a single characteristic in 
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TABLE 9-4. QUALITY CHARACTERISTIC DECREE/EXTENT MATRIX 


Product 

phase 



Service level 1 


0 

I 

2 

3 

1 4 

Planning 

No activity 

General high level 
required 

Specific detailed 
requirements definition 

Highly complex required 
definition and support 
model 

Difficult or complex 
required definition and 
prototype 

E 

■ 

No activity 

General 

architecture 

consideration; 

general test and 

measurement 

program 

Detailed architecture 
structure impact; 
language impact; test 
program extended 

Extensive architecture and 
structure consideration; 
tailored language, 
operating system, man- 
machine interface impact, 
etc.; code walkthroughs; 
detailed documentation 

Separate quality teams 
to verify design; 
detailed test facility; 
extensive qualification 
test plans and 
procedure 

m 

Integration 

and 

installation 

No activity 

General quality 
management 
program; accept- 
ance test; nominal 
change control 
quality program 

Extensive qualification 
test plans and 
procedure to verify 
characteristics; above- 
nominal-quality- 
requirement verifi- 
cation testing 

Quality teams formed; 
detailed quality configur- 
ation control release 
program; extensive data 
collection, verification, 
and analysis 

Specialized quality 
integration, 
manufacturing, and 
installation programs to 
ensure achievement of 
quality characteristics 
by separate quality 
organization 

B 

Service 

No activity 

General quality 
tracking and 
redesign program to 
achieve quality 
objectives and 
requirements 

Formal data collection 
and analysis program to 
verify quality object- 
ives; quality redesign 
effort 

Detailed measurements, 
data analysis, and 
modeling program to 
verify high-level quality 
objectives; extensive 
redesign to obtain quality 

Extensive measures 
and modeling, vigorous 
data analysis, and 
specialized tests to 
ensure high-level 
achievement of 
detailed quality 
requirements; extensive 
change program 

i 

0 

r 

t 


Mo quality 

First level of j 

quality 

Second level of quality 

Third level of quality ] 

Fourth level of quality 


L Degree of effort 

— — i 


figures 9 5 to 9—7, where the quality item is shown as either 
stable, unstable, or extremely costly to stabilize. 

In figure 9-5 an optimum tradeoff of stability and productiv- 
ity is portrayed. The base of the prism is secure, supporting the 
platform by properly balancing quality versus cost. In 
figure 9-6 schedule pressures have established an unstable prism 
to support the platform. In this example, the decision was made 
to send the product into the field at service level 1 even though 
it initially had reached a more extensive degree of quality (3) in 
the planning phase (considerable effort to define quality objec- 
tives in the planning phase but no followup). Figure 9-7 
presents the extremely costly view of upgrading a program- 
ming product in the field to service level 4 (after passing the 
first three phases only to the first degree). Note the increasing 
amount of time and effort to achieve service levels 1, 2, or 3° 
Service level 4 in this example is usually extremely difficult 
and expensive, if not impossible, to achieve. The measured 
productivity of such a product will most likely be low. 

An excellent example of the need for this type of quality 
management process occurred many years ago, but the lessons 
still apply today. An automated program was proposed to 
generate from 160 fields of input data per customer, a central- 
ized data base that would control a table-driven, wired logic 


system. It was estimated that 1 3 weeks of design time would be 
required to construct this table generator by using a nominal 
amount of computer support time. A representative of the 
design group was assigned to define the input and output 
requirements for the support program and verify its operation. 
The program was initially written in assembly language. It was 
later redesigned and split into three separate programs written 
in a high-level language. These programs could then be sepa- 
rately designed, verified, and maintained. The main consider- 
ation became the verification process. An input and output test 
was written to check the extensive program paths. The project 
dragged along for a year as verification testing attempted to 
meet a zero-defect objective (imposed after the initial design 
had been completed). Costs increased and the schedule became 
critical as the customer became impatient (fig. 9-7). As the 
program began to function more successfully, deciding the 
degree of testing required for verification became a serious 
problem. Confrontation developed between the design and 
marketing departments over the commercial release of the 
program. The testing continued without agreement on the 
required degree of effort. Eventually, the customer became 
disillusioned and turned to another firm to provide the table 
generator. 
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Figure 9-6.— Instability due to scheduling decisions. 


Had a clear quality management decision been made in the 
planning phase and tracked throughout the development on the 
degree of error-free “verified” operation, the quality character- 
istic objectives for its design architecture and structure, the 
language required for changes, and so forth, a more realistic 
projection (and control) of schedule and people could have 
been achieved. Several releases to the customer may have been 
required as the program designs and operation were verified to 
a predetermined extent within the various life-cycle phases. 
Had this procedure been followed, both the customer and the 
supplier would have been more satisfied. 

This example offered an excellent opportunity to first deter- 
mine the type and degree of quality desired. Then management 
could have constructed a quality process, in terms of the extent 



Figure 9-7.— Extremely costly programming products. 


and degree of each desired characteristic, with an elastic 
compromise between the schedule, resources, and design activ- 
ity needed to achieve it. In this case, many of the ilities, 
changeability, usability, maintainability, and reliability, were 
subsequently more critically identified. These considerations 
could have been translated into the initial requirements for 
structural design, program segmentation, extensive documen- 
tation, type of language, amount of code walkthrough, number 
of subfunctional tests, amount of error acceptable at first 
release, depth of verification reviews, and so on. From this 
form of planning, the quality prisms could have been estab- 
lished to define the extent and degree (such as service level 2, 
3, or 4) to which each of these characteristics should have been 
pursued in terms of project cost restraints that depended on user 
willingness to pay and wait for a quality product. 

A figuratively secure prismatic base for the programming 
product is presented in figure 9-5. This security is developed 
through execution of an extensive quality program, as progres- 
sively shown in figures 9-8 to 9-10. A product’s quality 
objective is usually composed of more than one characteristic. 
Previously, those have tentatively been noted as maintainabil- 
ity, portability, reliability, testability, understandability, 
usability, and freedom from error. Thus, quality management 
can extend the support prismatic structure to a greater depth 
than to just one quality characteristic. In practice, several 
quality prisms will be placed together to achieve a firm quality 
base. 

It may be desirable to have a product developed that has 
reached service level 4 for all the aforementioned quality 
characteristics. However, realistic schedules and productivity 
goals must be considered in terms of cost. These considerations 
establish the need for vigorous quality management over all 
life-cycle phases to selectively balance the various possibili- 
ties. It would be nonsupportive, expensive, and time consum- 
ing if quality management established the structural combination 
of individual characteristic quality prisms graphically 
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Quality management 


Figure 9-8— Delicate balance — planning complete. 



Quality management 

Figure 9-9 —Delicate balance — design and testing complete. 
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Quality management 

Figure 9-10. — Delicate balance — integration and installation complete. 


P Planning 
D Design and test 
I Integration and installation 
S Service 


P Planning 
D Design and test 
1 Integration and installation 
S Service 



Figure 9-1 1 —Example of poor quality management. 



Figure 9-12. — Example of good quality management. 


presented in figure 9-11. Unfortunately, this is the case for too 
many products. Quality management would do better to estab- 
lish a more consistent support structure, like that represented in 
figure 9-12. The figurative result of this consistent effort is 
shown in the solid cost-effective base of figure 9-13. 

If quality characteristics are established, monitored, meas- 
ured, and verified throughout the life cycle, a realistic balance 
can successfully be achieved between quality costs, schedule, 
and productivity. However, it will require an active quality 
management process to establish and track these indicators. An 
example of such a quality management process matrix is pre- 
sented in table 9-5 to quantify the extent and degree of effort 
needed to achieve a desired level of quality. This table can be 
used as a programming product quality worksheet or as both 
the characteristic survey data collection instrument and part of 
the final quality prisms planning document. 


Reliability 

Changeability 

Maintainability 

level 3 

level 2 

level 3 


Figure 9-13. — Example of solid quality base. 


As discussed, a quality management team must establish the 
degree of quality that a particular quality characteristic must 
reach throughout its life cycle. It may use specialized support 
tools, measurement systems, and specific product quality stan- 
dards to pursue its quality objectives. A point system can give 
a quantitative reference for the pursuit of quality. The point 
system can become the basis for trading time versus cost 
to reach specific quality goals. Of course, a firm's quality 
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TABLE 9-5.— EXAMPLE OF QUALITY MANAGEMENT 
PROCESS MATRIX 


[Number in circle denotes degree of quality selected by a 
quality management process.] 



Product phase 


Quality characteristic 

3 

tr 


Reliability 

Changeability 

Maintainability 

o 

Planning 

1 ©34 

123© 

1 2 ©4 

o 

>< 

Design and test 

! @34 

123© 

1 2 ©4 

ttl 

Integration and 
installation 

1 ©34 

1 @34 

123© 

▼ 

1 Service 

1 @34 

1 ©34 

1 2 ©4 


Degree of quality 


TABLE 9-6.— EXAMPLE OF PURSUIT OF QUALITY 


Product 

Quality characteristic 

phase 

Reliability 

Changeability 

Maintainability 

Planning 

2 

4 

3 

Design and test 



4 

3 

Integration and 
installation 



2 

4 

Service 


r 

2 

3 

Total points/ 
available points 

8/16 

(50 percent) 

12/16 

(75 percent) 

13/16 

(81 percent) 

Total 

(33/48)/C3, or (69 percent )/C3 


management will define their own point system. However, the 
following example point system will serve as an illustration for 
discussion purposes. 

If a single characteristic's quality effort has progressed 
through all four levels and through each level’s maximum 
degree, it has accumulated a maximum of4 + 4 + 4 + 4 = 16 
points. If another characteristic’s effort has moved through the 
levels only at one-half its maximum degree, it has accumulated 
2+2+2+2=8 points. If it reached three-quarters of the 
maximum degree of effort on all levels, it has 3 + 3 + 3 + 3 = 1 2 
points. Management can now assign a reference value to the 
pursuit of quality for a programming product. This is shown in 
the simplified example in table 9-6. For this example the total 
is 9 + 12 + 13 = 33 points out of a possible 16 + 16 + 16 = 48 
points, or 69 percent. (In more general terms, this can also be 
referred to as an overall level-3 quality effort in the 50- to 
75-percent range.) Note that the real indication of the quality 
objectives will be the magnitude of the XiY (33/48) values. The 
greater the X- and Y- values, the deeper the degree to which the 
characteristics have been pursued. The greater the X- value, the 
more stable the structure has become and the more quality 
objectives the programming product has achieved. 

If this type of analysis is carried over all eight characteristics 
(8X16), a maximum of 128 points is possible. Products that 
approach this level of effort will have a considerably more 
stable structure than those that are only based upon a 16-point, 


single-character structure. The X-percent quality reference 
number should also be qualified by a factor to note how many 
characteristics were actually used. This could be shown as 
69 percent/C3, or 33/48/C3. 

Finally, some characteristics will be more complex and 
require greater costs to achieve than others. Thus, a weighting 
multiplier (WM) can be used to equalize the quality character- 
istics. Weighting multipliers for the preceding example 
are demonstrated in table 9-7. For this example, the total of 
10+28+19 = 57 points out of a possible 20 + 40 + 24 = 84 poi nts 
is 57/84/C3, or 68 percent/C3. This three-part programming 
quality ratio (e.g., 57/84/C3) can be used for reviewing quality 
across programming products within a corporation as a more 
quantitativecrossreferenceofquality costs toquality objectives. 

A quality management process matrix (table 9-5) has been 
presented for pursuing quality throughout a programming 
product’s life cycle. It relates the pursuit of quality character- 
istics to the planning, design and testing, integration and 
installation, and service phases. In practice, actual implemen- 
tation of this approach will require the selection of languages, 
walkthroughs of code, type of testing, and so forth to be 
specifically defined for reaching service quality level 2, 3, or 4. 
From this matrix, the impact on schedule and the cost of quality 
can be projected and monitored. 

This process will also help management to compare the 
extent and degree of quality for products of competing compa- 
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TABLE 9-7. — EXAMPLE OF USE OF 
WEIGHTING MULTIPLIERS (WM) 


Product 

Quality characteristic 

phase 

Reliability 

Changeability 

Maintainability 


Level x WM 

Level x WM 

Level x WM 

Planning 

2x1 

4x2 

3x2 

Design and test 

2x1 

4x2 

3 x 1.5 

Integration and 
installation 

2x1 

2x3 

4x1 

Service 

2x2 

2x3 

3x1.5 

Total points/ 

10/20 

28/40 

19/24 

available points 

(50 percent) 

(70 percent) 

(79 percent) 

Total 

(57/84)/C3, or (68 percent)/C3 


nies or internal corporate divisions. Of course, until such a 
standard is developed, the quality management team will sub- 
jectively assign values and multipliers as noted in table 9-5 and 
relate them to their own acceptable degree of documentation, 
walkthrough of code, and module tests. These subjective values 
are extremely useful in establishing individual product quality 
effort goals, translating the concept of quality prisms to plan- 
ning, design, and test considerations that balance schedule and 
cost against quality objectives. However, management will 
now- have a more reasonable opportunity to pursue and success- 
fully achieve the extent and degree of desired quality for their 
products. 

The ability to specify an overall software quality metric has 
been addressed. Overall quality measurements can be normal- 
ized, as in the quality prisms concept, for purposes of compari- 
son. The quality prisms concept can be used to compare the 
software of two or more different projects within the same 
company or of different companies even if the software prod- 
ucts have unique applications or utilize different programming 
languages. Quality prisms can also be used to combine hard- 
ware quality and software quality into an assessment of the 
quality of the entire system. 


Software Quality Standards 

The relationship of software quality standards and software 
quality measurements is depicted in figure 9-14. Measure- 
ments and standards must agree. If a set of quality standards is 
established (e.g., zero defects) and q uality measurement cannot 
prove it (i.e., through exhaustive testing, error seeding, etc.), 
the software development project must realistically set a goal so 
that both quality standards and measurements can be devel- 
oped. The IEEE has published many articles on and general 
guides for formulating goal criteria. In addition, many technical 
papers are available on setting specific goals on the bases of life 
cycle and a per-delivered software product (ref. 9-1 1). 



Figure 9 - 14 .— Relationship of measurements and standards. 


Concluding Remarks 

This chapter has presented a snapshot of software quality 
assurance today and has indicated future directions. A basis for 
software quality standardization was issued by the IEEE. 
Research is continuing into the use of overall software quality 
metrics and better software prediction tools for determining the 
defect population. In addition, simulators and code generators 
are being further developed so that high-quality software can be 
produced. 

Several key topics were discussed: 

(1) Life-cycle phases 

(2) Software quality characteristics 

(3) Software quality metrics 

(4) Overall software quality metrics 

(5) Software quality standards 

(6) Process indicators 

(7) Performance measures 

Process indicators are closely tied to the software quality 
effort and some include them as part of software development. 
In general, there are measures such as ( 1 ) test cases completed 
versus test cases planned and (2) the number of lines of code 
developed versus the number expected. Such process indica- 
tors can also be rolled up (all software development projects 
added together) to give an indication of overall company or 
corporate progress toward a quality software product. Too 
often, personnel are moved from one project to another and thus 
the lagging projects improve but the leading projects decline in 
their process indicators. The life cycle for programming prod- 
ucts should not be disrupted. 

Performance measures, which include such criteria as the 
percentage of proper transactions, the number of system restarts, 
the number of system reloads, and the percentage of uptime, 
should reflect the user’s viewpoint. The concept of recently 
proposed performability combines performance and availabil- 
ity from the customer’s perspective. 

In general, the determination of applicable quality measures 
for a given software product development is viewed as a 
specific task of the software quality assurance function. The 
determination of the process indicators and performance mea- 
sures is a task of the software quality standards function. 
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Reliability Training 1 

1 . What are the three factors that determine quality software? 


A. Process, material, and vibration 

B. Process, product, and environment 

C. Planning, product, and shock 

D. All of the above 


2. What does software quality consist of? 

A. Various aspects of producing programming products 

B. Bar charts for process control 

C. Statistical analysis of software bugs 

D. All of the above 


3. How is the term “software quality” defined? 

A. To assure the acquisition of high-quality software products on schedule, within cost, and in compliance with the 
performance requirements 

B. To ignore various needs 

C. To develop specifications and attributes, perceive customer needs, and meet the user s expectations 

D. All of the above 


4a, What are the 10 software life-cycle phases? 

A. Conceptual; requirements; product definition; design; implementation; testing; vibration; prototypes; installation; and 
disposal 

B Planning; definition; design; manufacturing; testing; acceptance; debugging; and repair ... 

C Conceptual planning; requirements definition; product definition; top-level design; detailed design; implementation, 
testing and integration; qualification, installation, and acceptance; maintenance and enhancements; and disposal 

D. All of the above 


4b. What are the IEC system life-cycle phases? 

A Concept and research; design and plan; manufacture and debug; operation and maintenance; and wearout 

B. Concept and definition; design and development; manufacturing and installation; operation and maintenance, and 

C. Research and development; design and breadboard; manufacturing and testing; operation and maintenance; and disposal 

D. All of the above 


4c. How can the 10 software life-cycle phases be combined to fit in the IEC system life-cycle phases? 

A. Concept and definition: conceptual planning; requirements definition; and product definition 
B Design and development: top-level design and detailed design . 

C. Manufacturing and installation: implementation; testing and integration; qualification; and installation and acceptance 

D. Operations and maintenance: maintenance and enhancement 

E. Disposal: disposal 

F. All of the above 


'Answers are given ai the end of this manual. 
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5. Can there be different degrees of a quality characteristic for different life-cycle phases? 

A. Yes B. No C. Do not know 

6a. The definition of a lack of software quality is 

A. The lack of proper planning in early life-cycle phases 

B. The application of dependent software quality characteristics 

C. Poorly developed software that lacks proper criteria in life-cycle phases 

D. All of the above 

6b. Three example characteristics of software quality are 

A. Testing, integration, and portability 

B. Maintainability, portability, and reliability 

C. Design, implementation, and reliability 

D. All of the above 


7. Seven software quality characteristics are 

B teStabiHty ’ Understa ndability, usability, and freedom from error 

B. Planning, definition, reliability, testing, software, hardware, usability 

n ^ sl S n ’ ,m P lementation - integration, qualification, acceptance, enhancement, maintenance 
u. All of the above 

8. Management has decided that quality engineering should measure four characteristics of the XYZ software- maintainability 

S n Uy ’ and ThC deS ’ red § ° aIS SCt 3t the inning of the program by m n 

r nta,nabd ' ty> 3 5: POrtabiHty ’ 3 0; reliabl,ity ' 3 * a " d testability 3.5. The overall goal was thus 

percent/C4 for the extent of quality. The 2-year program gave the following results: 


Characteristic 

Planning 

Design and test 

Integration 

Service 

Maintainability 

4.0 

3.5 

3.4 

3.4 

Portability 

4.0 

3.0 

3.1 

3.1 

Reliability 

3.5 

3.6 

3.9 

3.9 

Testability 

AH 

A± 

3.5 


Total 

15.5 

13.2 

13.9 

14.0 


a. The actual extent of quality was 

A. (87.5 percent)/C4 B. (88.4 percent)/C4 C. (88.8 percent)/C4 D. None of these 

b. Have the management objectives been achieved? 

^ es B. No C. Do not know 
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Chapter 10 

Reliability Management 


Roots of Reliability Management 

Over the past few years the term “reliability management” 
has been raised to a high level of awareness. Previously, the 
management of reliability was concerned with eliminating 
failure by testing to prove reliability, and it generally comple- 
mented the design function. Quality management, on the other 
hand, focused on quality control and generally aligned itself 
with manufacturing and production. The picture began to change 
with the focus on customer reliability and quality concerns. 
Specifically, the usage and standardization by companies of 
reliability growth models established that the new concept of 
reliability management is replacing the old concept of the 
management of reliability. The focus is now on enlarging the 
area of reliability concern to all phases of the life cycle. The 


current thinking is that all aspects of management operations 
and functions must be integrated in the reliability concept and 
program. Thus, reliability in the manufacturing or production 
phase is as important as reliability in the design phase 
(ref. 10-1), as shown in figure 10-1. 

Planning a Reliability Management 
Organization 

Pl anning a reliability management organization requires that 
the reliability function report to a high enough level to be effec- 
tive. The reporting level is too low if it does not involve top 
management in reliability issues. For example, many success- 
ful programs today encompass 3 to 6 hours per month at vice- 





Customer 




Figure 1 0-1 . — Life-cycle reliability growth with two different parts to first customer shipment. 


NASA/TP— 2000-207428 


147 



presidential staff meetings. Each company must find the level 
that makes reliability a significant issue to be addressed. A 
guide to reliability management is reference 10-2. 

A functional organization forms groups to perform similar 
generic tasks such as planning, designing, testing, and reliability. 
Often, such an organization gets mired down with too many 
levels of management, and specific product priorities are often 
different in the many task groups. However, many benefits 
accrue from the concentration of talent and constant technical 
peer review. With today’s time-to-market pressures, building 
such a large centralized reliability organization is often not the 
best choice. The team approach, distributed reliability, is often 
selected over functional organization. 

In a team organization, people with diverse talents and 
backgrounds comprise the teams. Quality circles and reliability 
circles are based on the same organizational approach. Even 
though peerreview is not ongoing, the cross technology knowl- 
edge of today’s personnel appears to fully compensate for the 
lack of constant peer review. In the software development 
world, several types of team organization exist. For instance, 
the first type, the project team, is typical and is a hierarchical 
organization in which programmers with less experience are 
assigned to work for programmers with more experience. The 
project team is designed to fit the company organization rather 
than to fit project requirements. The second type is the chief 
programmer team, which employs a highly skilled person who 
performs most of the programming while providing technical 
direction. A third type is the Weinberg programming team, 
which is composed of groups of 1 0 or fewer programmers with 
complementary skills. Group consensus and leadership role 
shifts are characteristic of this type. Each of these team organi- 
zations has advantages depending on the size of the project, the 
newness of the technology being implemented, and so on. 

The fourth type of team organization, the matrix, is a hybrid 
approach that combines functional talent to put teams together, 
but it can be a reliability disaster especially if time-to-market 
pressures exist. Often the technology is masked by middle 
management procedural meetings because these teams report 
to one manager. Individual contributors are added to work on 
one or more tasks of a given project or product development. 
These projects usually report to middle management. 

A fifth possible type of team organization is based on the 
theory stated in reference 10—3: reliability is actively pursued 
by involvement starting on the vice-presidential level and pro- 
ceeds throughout the organization. This new style of reliability 
involves establishing a reliability council, dedicating a full-time 
diagnostic person or team, and generally making an upward 
change in the reliability reporting level. Figure 10-2 presents 
this concept. The reliability council’s responsibilities are to 



Figure 10-2. — Reliability organization. 


(5) Assign tasks 

(6) Regularly review tasks 

(7) Participate in reliability improvement awards 

The reliability council membership may consist of the 

(1) Vice president of the company or division as chairman 

(2) Vice president’s staff 

(3) Vice president’s business partners 

(4) Corporate engineering director 

(5) Corporate manufacturing director 

(6) Coiporate customer services director 

The diagnostic team’s or person’s functions are to 

( 1 ) Review the internal reliability status 

(2) Review reliability as perceived by customers 

(3) Recommend tasks to the reliability council 

(4) Diagnose problems 

(5) Design experiments 

(6) Collect and analyze data 


The diagnostic team’s or person’s concerns include 

( 1 ) Reliability, quality, and statistics 

(2) Engineering and manufacturing engineering 

(3) Product development and process optimization 

(4) Product assembly and test strategies 

(5) Customer perception 

This is a new dynamic approach for establishing reliability 
management at the proper level in a corporation while optimiz- 
ing its effectiveness. 


General Management Considerations 


(1) Endorse the annual reliability plan 

(2) Regularly review reliability status 

(3) Approve reliability improvement projects 

(4) Set priorities on resources 


Program Establishment 

To design for successful reliability and continue to provide 
customers with a reliable product, the following steps are 
necessary: 
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(1) Determine the reliability goals to be met. 

(2) Construct a symbolic representation (e.g., block dia- 
gram or Petri net, ref. 10-4). 

(3) Determine the logistics support and repair philosophy. 

(4) Select the reliability analysis procedure. 

(5) Select the source or sources of the data for failure rates 
and repair rates. 

(6) Determine the failure rates and the repair rates. 

(7) Perform the necessary calculations. 

(8) Validate and verify the reliability. 

(9) Measure reliability until customer shipment. 

This section will address the first three steps in detail 


Goals and Objectives 

Goals must be placed into the proper perspective. They are 
often examined by using models that the producer develops. 
However, one of the weakest links in the reliability process is 
the modeling. Dr. John D. Spragins, an editor for the IEEE 
Transaction on Computers, places this fact in context 
(ref. 10-3) with the following statement: 

Some standard definitions of reliability or avail- 
ability, such as those based on the probability that 
all components of a system are operational at a 
given time, can be dismissed as irrelevant when 
studying large telecommunication networks. Many 
telecommunication networks are so large that the 
probability they are operational according to this 
criterion may be very nearly zero; at least one item 
of equipment may be down essentially all of the 
time. The typical user, however, does not see this 
unless he or she happens to be the unlucky person 
whose equipment fails; the system may still operate 
perfectly from this user’s point of view. A more 
meaningful criterion is one based on the reliability 
seen by typical system users. The reliability appar- 
ent to system operators is another valid, but distinct, 
criterion. (Since system operators commonly con- 
sider systems down only after failures have been 
reported to them, and may not hear of short 
self-clearing outages, their estimates of reliability 
are often higher than the values seen by users.) 

Reliability objectives can be defined differently for various 
systems. An example from the telecommunications industry 
(ref. 10-5) is presented in table 10-1. We can quantify the 
objectives, for example, for a private automatic branch exchange 
(PABX) (ref. 10-6) as shown in table 10-2, which presents the 
reliability specifications for a wide variation of PABX sizes 
(from fewer than 120 lines to over 5000 lines). 


TABLE 1 0-1 . — RELIABILITY OBJECTIVES FOR 


TELECOMMUNICATIONS INDUSTRY 


Module or system 

Objective 

Telephone instrument 

Mean time between failures 

Electronic key system 

Complete loss of serv ice 
Major loss of service 
Minor loss of setvice 

PABX 

Complete loss of service 
Major loss of service 
Minor loss of service 
Mishandled calls 

Traffic service 
oosition svstem (TSPS) 

Mishandled calls 
Svstem outage 

Class 5 office 

System outage 

Class 4 office 

Loss of service 

Class 3 office 

Service degradation 


Symbolic Representation 

Chapter 3 presents reliability diagrams, models that are the 
symbolic representations of the analysis. The relationship of 
operation and failures can be represented in these models. 
Redundancy (simple and compound) is also discussed. Perfor- 
mance estimates and reliability predictions are now being 
performed simultaneously by using symbolic modeling con- 
cepts such as Petri nets. 

In 1966, Carl Adam Petri published a mathematical tech- 
nique for modeling. Known as a Petri net, it is a tool for analyzing 
systems and their projected behavior. In 1987, he delivered the 
keynote address at the international workshop on Petri nets and 
performance models (ref. 10-7). Many applications were dis- 
cussed: the use of timed models for determining the expected 
delay in complex sequences of actions, the use of methods to 
determine the average data throughput of parallel computers, 
and the average failure rates of fault-tolerant computer designs. 
Correctness analysis and flexible manufacturing techniques 
were also described. Timed Petri nets show promise for analyz- 
ing throughput performance in computer and communications 
systems. 

A Petri net is an abstract and formal graphical model used for 
systems that exhibit concurrent, asynchronous, or nondeter- 
ministic behavior. The Petri net model provides accurate sys- 
tem information when it validly represents the system and the 
model solution is correct. A Petri net is composed of four parts: 
a set of places, a set of transitions, an input function, and an 
output function. The input and output functions relate to tran- 
sitions and places. In general, graphics are used to represent the 
Petri net structures and to show the concepts and the problems. 
A circle represents a place, a bar represents a transition, and 
directed arcs connect transitions to places or places to transi- 
tions. The state of a Petri net is called the PN marking and is 
defined by the number of “tokens” contained in each place. 
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TABLE 10-2 —RELIABILITY SPECIFICATION FOR PARY 





i run. rAiJA 

Number of lines j 


<120 

200 

400 

600 

800 

1200 

3000 

5000 

Common control performance: 









Mean time between catastrophic 

10 









failures, yr 

System outage time per 20 yr, hr 






j 

1 

>5 


Mean time between outages, yr 

— 

— 



.... 


>5 

1 

>5 

Mean time between complete 

5 

10 

40 

40 

40 


losses of service, yr 







Service level: 









Mean time between major losses 

200 

400 

300 

200 

150 

365 

365 


of service, days 

Mean time between minor losses 
of service, days 

60 

60 

50 

40 

30 

— 

30 

15 

Degradation of service, hr/yr 



.... 







Mishandled calls, percent 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

1 

0.02 


TABLE 


Subsystem 

Onsite 

spares 

? 

Subdepot 

spares 

? 

Turnaround 
time* of 
subdepot 
spares, 
days 

Depot 

spares 

? 

Turnaround 
time* of 
depot 
spares, 
days 

Common control and 

Yes 

Yes 


2 

Yes 

15 

memory 









Network 

No 







30 

Line and trunk units 

Yes 







30 

Peripheral equipment 

No 







30 

Test equipment 

*r~ i _ ■ 

No 

No 

— 

- 



5 


A place is an input to a transition if an arc exists from the place 
to the transition and an output if an arc exists from the transition 
to the place. Enabled transitions can be “fired” by removing one 
token from each input place and adding one token to each 
output place. The firing of a transition causes a change of state 
and produces a different PN marking. Reference 1 0-8 contains 
additional information. Petri nets are a useful reliability 
modeling tool. 


Logistics Support and Repair Philosophy 

The logistics support plan is normally based on criteria such 
as (1) failure rates and repair rates of replaceable units, (2) 
system maturity, (3) whether the sites can be served by depots 
or subdepots, and (4) the rate at which additional sites are added 
to the depot responsibility. Since spares are the key to support, 
this chapter will examine them further. 

The size of the spares stock depends on ( 1 ) the criticality of 
the replaceable unit to the system, (2) the necessary spare 
adequacy level, (3) the number of systems served, (4) whether 


the area served is rural, suburban, or urban, and (5) whether the 
repair facility is onsite or remote. A typical spares policy for a tele- 
communications system (ref. 10-9) is presented in table 10-3. 

Policies can be formulated for families of systems or for 
multifamily geographical areas. The turnaround time depends 
on the replaceable units failure rate, the repair location, the 
repair costs, and so forth. A specific spares policy can be tailored 
to a given geographical area. Note that subsystems have differ- 
ent spares policies owing to the criticality of their failures in 
contrast to a blanket spares assignment without regard to 
functionality or survivability. 

Even though the spares location and turnaround time are the 
same for two different subsystems, the spares adequacy can be 
different. Some spares adequacy levels for a telecommunica- 
tions systems are presented in table 10-4. 

Spares provisioning is an important part of a spares plan. 
Requirements must be clearly stated or they can lead to over- or 
undersparing. For example, a spares adequacy of 99.5 percent 
can be interpreted in two ways. F irst, six spares might be needed 
to guarantee that spares are available 99.5 percent of the time. 
Alternatively, if one states that when a failure occurs, a spare 
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TABLE 10-4.— SPARES ADEQUACY 


Subsystem 

Onsite 

spares? 

Subdepot 

spares 

Depot 

spares 



Adequacy 1 

Common control and 

Yes 

0.9995 

0.9995 

memory 

Network 

No 

.995 

.995 

Line and trunk units 

Yes 

.999 

.999 

Peripheral equipment 

No 

.99 

.99 

Test equipment 

No 


.95 


‘Probability of having spares available. 


TABLE 1 0-6. — MAINTENANCE ACTION 
RECOMMENDATIONS 


Action 

Before 

busy 

hour 

Busy 

hour 

w, 

HTO|J| 

Off-shift 

time 

Repair 

Yes 

Yes 

Yes 

Yes 

Defer repair for (days) 

0 

0 

1 

1 

Is second failure affecting 

No 

Yes 

No 

No 

service? 





Probability of no similar 

0.95 

0.90 

0.82 

0.60 

second failure 





Site failures last month 

Low 

High 

Normal 

Low 

Site failures last year 

Low 

Low 

Normal 

Low 

Transient error rate 

Low 

High 

Low 

Low 


TABLE 10-5. — DEPOT EFFECTIVENESS FOR TYPICAL DIGITAL PABX 


Foreign 

branch 

part 

Control 

automatic 

trunk 

Printed wiring cards for n systems 

Spare printed wiring cards for n systems 

1 

2 

10 

50 

100 

1 

2 

10 

50 

100 

in 

6 

65 

130 

m 

3 250 

6 500 

2 

2 

5 

13 

20 

H 9 

5 

16 

32 

■9 

800 

1 600 

1 

1 

2 

5 

7 

15004 

6 

14 

28 

140 

700 

1 400 

1 

1 

4 

5 

8 

■ 

8 

28 

56 

H 

1400 

2 800 

2 

1 

4 

10 

15 


16 

153 

206 

m 

7 650 

15 300 

7 

11 

29 

106 

196 


Total 

1058 

2116 

10 580 

52 900 

105 800 

153 

173 

287 

658 

1001 

Spares, percent of total 

14.5 

8.2 

2.7 

1.2 

; 0.95 


must be available 99.5 percent of the time, it will be necessary 
to supply 6+1=7 spares. 

The establishment of depot and subdepot sparing, rather than 
only individual site sparing, has proven to be cost effective. As 
an example, table 10-5 presents the depot effectiveness for a 
typical digital PABX. This table indicates that a 14.5-percent 
spares level wouldbe required if only per-site sparing was used; 
however, when one depot serves 100 sites, the required spares 
level is less than 1 percent. 

A centralized maintenance base (CMB) (ref. 1 0- 1 0) is essen- 
tial to a deferred maintenance concept. Deferred maintenance 
can be available on a real-time basis. When a failure occurs at 
an unattended site, the CMB would receive information on a 
display as to the criticality of the failure and the deferred main- 
tenance action taken if imposed and would receive a proj ection 
indicating impending problems. The CMB would analyze the 
situation for the specific site configuration, the processing level 
in the system, and the site’s failure-repair history. 

Input data could consist of items such as the last similar 
occurrence, the next planned visit to the site, the criticality of 


the site to the operating network, the cumulative site failures 
for the last 3 months, and the probability of additional failures 
occurring. The data would be analyzed with a maintenance- 
prediction computer program to generate a table based on 
system loading, such as table 10-6. Often the suggested 
maintenance deferral time is recommended to be the next 
maintenance visit (NMV). The NMV will vary with the amount 
of equipment onsite and the projected failure frequency 
(ref. 10-10). 

The combination of deferred maintenance and a centralized 
maintenance base dictates the needs for an efficient spares 
program. Spares planning combined with knowledge of the 
logistics can optimize support costs. A depot stocking plan can 
additionally vary because of many factors, including enor 
coverage, system maturity, deferred repair, and maintenance 
familiarity. A dynamic (continuously updated) depot stocking 
plan would be cost effective. A dynamic depot model using 
Monte Carlo methods (ref. 10-1 1) includes unit delivery sched- 
ules, item usage per month, support personnel efficiency, and 
depot and base repair cycle times. 
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Top down 



Figure 10-3. — Overall reliability process. 


Reliability Management Activities 

Performance Requirements 

It is often difficult to translate customer performance require- 
ments into design requirements, especially in the area of quality 
and reliability. Reliability encompasses both quantitative and 
qualitative measures. New terms in the computer industry, such 
as “robustness,” are not formally metricized. However, we can 
adapt concepts for the overall performance process (ref. 10-12) 
to apply to reliability as presented in figure 10-3. 

If a business’ matrix of reliability requirements is reduced to 
one or more models, subjective and qualitative customer- 
oriented reliability measures can be translated into quantitative 
system-oriented reliability criteria. Figure 1 0-3 identifies both 
the top-down and bottom-up approaches to reliability valida- 
tion, which include (1) translation, (2) allocation, (3) require- 
ments, and (4) planning. 

With the identification of the agreed-to system-oriented 
reliability criteria, designer-oriented subsystem or module 
reliability parameters can be allocated as shown in fig- 
ure 10-3, generally by a system reliability team. The team 
evaluates simple versus redundant configurations, levels of 
fault detection and correction implementations, software con- 
siderations, and so forth. System or module reliability model- 
ing may specify reliability requirements for specific components. 
An example of such modeling is a failure modes and effects 
analysis (FMEA) performed on a product to predict the prob- 
ability of network failures due to a single failure or due to a 
failure after an accumulation of undetected failures. 


For example, a replacement product was to use a very large- 
scale integration (VLSI) implementation, and the protection 
against network failures needed to be assessed. An investiga- 
tion found no apparent standard industry FMEA method for 
VLSI components. Because future VLSI products may show an 
increasing need for FMEA, it is important that an industry 
standard be generated. In the network examples discussed, a 
single fault could directly cause a customer-oriented problem. 

The bottom-up approach to reliability validation ensures 
customer satisfaction. The appropriate certification, process 
metrics, and statistical in-process tests must be designed from 
the customer viewpoint, A step-by-step upward certification 
and design review using process metrics can be designed to 
ensure customer-orientedreliability . In addition, we can see the 
need for the independent upward path from reliability planning 
and standards to customer-oriented reliability in figure 10-3. 
This is the key to success, since reliability control cannot be 
bypassed or eliminated from design- or performance-related 
issues. 


Specification Targets 

A system can have a detailed performance or reliability spec- 
ification that is based on customer requirements. The surviv- 
ability of a telecommunications network is defined as the 
ability of the network to perform under stress caused by cable 
cuts or sudden and lengthy traffic overloads and after failures 
including equipment breakdowns. Thus, performance and avail- 
ability have been combined into a unified metric. One area of 
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Figure 10-4. — Specification target (ref. 10-14). 


telecommunications where these principles have been applied 
is the design and implementation of fiber-based networks. 
Reference 10-13 states that “the statistical observation that on 
the average 56 percent of the pairs in a copper cable are cut 
when the cable is dug up, makes the copper network ‘structur- 
ally survivable.’ ” On the other hand, a fiber network can be 
assumed to be an all-or-nothing situation with 100 percent of 
the circuits being affected by a cable cut, failure, or other 
destruction. In this case study, according to reference 10-13, 
“cross connects and allocatable capacity are utilized by the 
intelligent network operation system to dynamically reconfigure 
the network in the case of failures.” Figure 10-4 (from 
ref 10-14) presents a concept for specification targets. 


plex new hardware and software programs . F igure 1 0-6 (taken 
from ref. 10-1) presents the traditional viewpoint of the design, 
development, and production community on cumulative reli- 
ability growth. It is possible that the same data generated both 
curves in figure 10-6. When we measure the cumulative 
reliability growth, the decline of production coupled with a 
decline of reliability is masked. If we track the product on a 
quarterly basis, often the product shows a relaxation of process 
control, incorporation of old, marginal components into the last 
year’s product manufacture, failure to incorporate the latest 
changes into service manuals, knowledgeable personnel trans- 
ferred to other products, and so forth. Thus, there is a need to 
track specific products on a quarterly basis (ref. 10-1). 


Field Studies 


Human Reliability 


The customer may observe specific results of availability. 
For instance, figure 10—5 has been the basis for the proposal of 
an IEC technology trend document (ref. 10-15). 

System reliability testing is performed today to benchmark 
the reliability, availability, and dependability metrics of com- 


c 

Q) 

O 


CO 

> 

< 



Figure 10-5.— Software availability. 


Analysis Methods 

The major objectives of reliability management are to ensure 
that a selected reliability level for a product can be achieved on 
schedule in a cost-effective manner and that the customer 
perceives the selected reliability level. The current emphasis in 
reliability management is on meeting or exceeding customer 
expectations. We can view this as a challenge, but it should be 
viewed as the bridge between the user and the producer or 
provider. This bridge can be titled “human reliability ” In the 
past, the producer was concerned with the process and the 
product and found reliability measurements that addressed 
both. Often there was no correlation between field data, the 
customer’s perception of reliability, and the producer’s reli- 
ability metrics. Surveys then began to indicate that the cus- 
tomer or user distinguished between reliability performance, 
response to order placement, technical support, service quality, 
and so on. 
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Rgure 10-6.— Traditional viewpoint of reliability growth (ref. 10-1 ). 


Human Errors 

Human reliability is defined (ref. 10-16) as “the probability 
of accomplishing a job or task successfully by humans at any 
required stage in system operations within a specified minimum 
time limit (if the time requirement is specified).” Although 
customers generally are not yet requiring human reliability 
models in addition to the requested hardware and software 
reliability models, the science of human reliability is well 
established. 


Example 

Presently, the focus in design is shifting from hardware and 
software reliability to human reliability. A recent 2 1/2-year 
study by Bell Communication Research (ref 1 0-1 7) indicated 
that reliability in planning, design, and field maintenance pro- 
cedures must be focused on procedural errors, inadequate 
emergency actions, recovery and diagnostic programs, the 
design of preventive measures to reduce the likelihood of pro- 
cedural errors, and the improvement of the human factors in the 
design and subsequent documentation. The study revealed 
the following results for outages or crashes as shown in fig- 
ure 10-7. Approximately 40 percent of outage events and 
downtime is due to procedural problems (human error). In fact, 
if software recovery problems are included with procedural 
problems, 62 percent of the events and 68 percent of the 
downtime are due to human error. Therefore, human reliability 
planning, modeling, design, and implementation must be 
focused on to achieve customer satisfaction. 


Outage 
frequency 
(events or 
crashes), 
percent 


Downtime 
(3.5 min) 
per year 
per machine, 
percent 


Operational 

software 


24 


Recovery 
’ software 


26 


29 


Hardware 


30 


38 


Procedural 


42 


Figure 10-7. — Reliability characteristics. 
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Presentation of Reliability 

Reliability testing usually occurs during product develop- 
ment and ends with the first product shipment. However, 
product reliability testing can be cost effectively run through 
the manufacturing life of the product to achieve both continued 
customer satisfaction and the inherent reliability of the product. 

A major concern in planning reliability testing is the maturity 
of the specific manufacturing facility. For instance, a new plant 
may initially need three to five failures per week of tested 
product under controlled test environments to shape the manu- 
facturing process and the product specifics. Therefore, detailed 
failure analysis will be conducted on 1 50 to 250 failed items per 
year. Once plant personnel begin to feel comfortable as a team 
and several of the plant’s processes, products, or both are 
certified, the goal of one failure per week can be instituted in a 
medium-mature plant. The team in a mature plant with few 
failures can observe leading indicators that forewarn of pos- 
sible problems and can prevent them from entering into the 
shipped product. Thus, in a mature plant the goal of one failure 
per 2 weeks can suffice as a benchmark for quality operations 
to achieve product reliability. 


Engineering and Manufacturing 

Measuring reliability in a practical way is a challenge. 
Reliability grows with product, process, and customer use 
maturity. We could measure, for example, the reliability at the 
first customer shipment and the reliability during a 5 -year 
production life. An effective start may be to establish a three- 
to five-level reliability tier concept (ref. 10-18). For example, 
table 10-7 presents a five-tier reliability concept. With this 
concept, products can achieve the first customer shipment at a 
mean time between failures (MTBF) of 71(min). Manufacturing 
and service will accept risks until T^spec) is reached. Manufac- 
turing has a commitment to drive the MTBF of the product up 
to r(spec), and engineering has a commitment to provide 
resources for solving design problems until T^spec) is reached. 
The qualification team working with this process is now 


TABLE 10-7.— FIVE-TIER RELIABILITY CONCEPT 


Tier 

Mean time 
between 
failures 

Description 

1 

7(min) 

Minimum demonstrated MTBF before shipping 



(statistical test) 

2 

/'(spec) 

Specified MTBF that meets market needs and 



supports service pricing 

3 

71(design) 

Design goal MTBF (calculation) 

4 

7T[intrinsic) 

Intrinsic MTBF (plant measurement) 

5 

Afield 

Field MTBF measurement 


involved throughout the design qualification process through 
field feedback. Ideally, the MTBF’s of tiers 2 to 5 would be 
equal; however, the calibration of reliability modeling tools 
and the accuracy of field MTBF measurements are challenges 
yet to be met in some corporations and industries . Thus, a three- 
to five-tier approach is a practical and effective solution for 
developing reliability measurements. 

Although the MTBF is between r(min) and T(spec), progress 
is tracked toward T^spec) as a goal. The point is to find and fix 
the problems and thus improve the reliability of the product. 
Teamwork and commonality of purpose with manufacturing 
and engineering are necessary to deal with real problems and 
not symptoms. After T(spec) has been achieved, an “insurance 
policy” is necessary to determine if anything has gone radically 
wrong. This can be a gross evaluation based on limited data as 
the “premiums” for a perfect “insurance policy” are too high. 
Once T(spec) has been demonstrated, a trigger can be set at the 
50-percent lower MTBF limit for control purposes. Improve- 
ment plans at this level should be based on the return on 
investment. At maturity, ^intrinsic), dependence on reliability 
testing can be reduced. A few suggestions for reductions are 
testing fewer samples, shortening tests, and skipping testing for 
1 or 2 months when the personnel feel comfortable with the 
product or process. With a reduced dependence on reliability 
testing, other manufacturing process data can be used for full 
control. 


User or Customer 

Reliability growth has been studied, modeled, and analyzed — 
usually from the design and development viewpoint. Seldom is 
the process or product studied from the customer s or user s 
perspective. Furthermore, the reliability that the first customer 
observes with the first customer shipment can be quite different 
from the reliability that a customer will observe with a unit or 
system produced 5 years later, or the last customer shipment. 
Because the customer’s experience can vary with the maturity 
of a system, reliability growth is an important concept to 
customers and should be considered in the customer’s purchas- 
ing decision. 

The key to reliability growth is the ability to define the goals 
for the product or service from the customer’s perspective 
while reflecting the actual situation in which the customer 
obtains the product or service. For large telecommunications 
switching systems, there has been a rule of thumb for determin- 
ing reliability growth. Often systems have been allowed to 
operate at a lower availability than the specified availability 
goal for the first 6 months to 1 year of operation (ref. 10-19). 
In addition, component part replacement rates have often been 
allowed to be 50 percent higher than specified for the first 
6 months of operation. These allowances accommodated 
craftspersons learning patterns, software patches, design 
errors, and so on. 
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TABLE 10-8. — 1980 GENERIC QUALITY METRICS 
PFromref. 10-20.1 


Metric 

Implementation phase 


Requirements 

Design 

Laboratory 

Field test 

Field 




system test 

1 


performance 

Open questions 

0 


rr; 


rz 



Problems fixed, per 



1/500 

1/1000 

1/1000 

words 







Problems open, per 

— 

1/5000 

1/5000 

1/2000 

1/2000 

words 







InteiTupts, per day 

— 


<20 

<20 

<25 

Audits, per day 

— - 

0 

<10 

<10 

<25 

Service affective 

— 



0 


) 

1.8 

incidents, per 
office month 








Reinitializations, per 

— 






1 

month 








Cutoff calls, per 

— 






<0.2 

10000 

Denied calls, per 







<0.7 

10000 

Trunk out of service, 







20 

min/yr 




- 




TABLE 10-9.— PRODUCTION LIFE-CYCLE RELIABILITY GROWTH CHART 


System size 

Year 


1987 

1988 


1994 



Quarter 



Qi 

Q2 

Q3 

Q4 

Qi 

Q2 


Q3 

Q4 

Small system: 
Reliability growth, 

5 

0 

0 

0 

0 

0 


0 

0 

percent 
Time to steady 

3 

0 

0 

0 

0 

0 


0 

0 

state, months 









Medium system: 
Reliability growth, 

100 

50 

25 

10 

10 

10 


10 

10 

percent 
Time to steady 

6 

3 

2 

1 

I 

1 


J 

1 

state, months 









Large system: 










Reliability growth. 

200 

100 

50 

50 

33 

33 


20 

20 

percent 









Time to steady 

12 

9 

6 

3 

3 

3 


3 

3 

state, months 
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The key to reliability growth is to have the growth measure- 
ment encompass the entire life cycle of the product. The 
concept is not new, only here the emphasis is placed on the 
customer’s perspective. Reference 10-20 presents the goals of 
software reliability growth (table 10-8). 

Table 10-8 covers a large complex system with built-in fault 
tolerance. Reference 10—21 regarded this system as not 
“technically or economically feasible to detect and fix all 
software problems in a system as large as No. 4 ESS [electronic 
switching system]. Consequently, a strong emphasis has been 
placed on making it sufficiently tolerant of software errors to 
provide successful operation and fault recovery in an environ- 
ment containing software problems.” 

Reliability growth can be specified from “day 1” on a product 
development and can be measured or controlled on a product 
with a 1 0-year life until “day 5000.” We can apply the philoso- 
phy of reliability knowledge generation principles, which is to 
generate reliability knowledge at the earliest possible time in 
the planning process and to add to this base for the duration of 
the product’s useful life. To accurately measure and control 
reliability growth, we must examine the entire manufacturing 
life cycle. One method is the construction of a production 
life-cycle reliability growth chart. 

Table 10-9 presents a chart for setting goals for small (e.g., 
a 60-line PABX or a personal computer), medium, and large 
systems. Small systems must achieve manufacturing, shipping, 
and installation maturity in 3 months to gain and keep a market 
share for present and future products. This is an achievable but 
difficult goal to reach. The difference in reliability growth 
characterization between small systems and larger systems is 
that the software-hardware-firmware interaction, coupled with 
the human factors of production, installation, and usage, limits 
the reliability growth over the production life cycle for most 
large, complex systems. 

In certain large telecommunications systems, the long instal- 
lation time allows the electronic part reliability to grow so that 
the customer observes the design growth and the production 
growth. Large, complex systems often offer a unique environ- 
ment to each product installation, which dictates that a signifi- 
cant reliability growth will occur. Y et, with the difference that 
size and complexity impose on the resultant product reliability 
growth, corporations with a wide scope of product lines should 
not present overall reliability growth curves on a corporate 
basis but must present individual product line reliability growth 
pictures to achieve total customer satisfaction. 
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Reliability Training 1 

1. Reliability management is concerned with what phases of the life cycle? 

A. Design and development B. Manufacturing C. Customer D. All of the above 

2. Name a new style of organizing reliability activities. 

A. Functional B. Team C. Matrix D. Council 

3. What are the functions of the diagnostic team or person? 

A. Review the internal reliability status 

B. Review reliability as perceived by the customer 

C. Recommend tasks to the reliability council 

D. Diagnose problems 

E. Design experiments 

F. Collect and analyze data 

G. All of the above 

4. Name a goal category for a telephone instrument. 

A. Loss of service 

B. Mean time between failures 

C. Mishandled calls 

D. All of the above 

5. A PABX with 80.0 lines has a service level reliability specification for the mean time between major losses of service (MTBF) of 
A. 150 days B. 1 hour C. 0.1 percent D. All of the above 

6. A Petri net is composed of which of the following parts? 

A. A set of places 

B. A set of transitions 

C. An input function 

D. An output function 

E. All of the above 

7. For a telecommunications system, what is the spares adequacy level for a network subsystem with spares depots? 

A. 0.999 B. 0.995 C. 0.95 

8. Turnaround time depends on 

A. Replaceable unit failure rate 

B. Repair location 

C. Repair cost 

D. All of the above 


•Answers are given at the end of this manual. 
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9. Spares adequacy is the probability of having spares available. 

A. True B. False C. Do not know 

10. What is the normal maintenance action recommendation for the site to defer repair for (days) during off-shift time? 


A. 0 B. 2 C. 1 

1 1. The bottom-up approach to reliability makes use of planning, requirements, allocations, and customer orientation. 

A. True B. False C. Do not know 

12. Specification targets can be used to define what performance and availability requirements? 

A. Fully operational 

B. Subliminal availability 

C. Degraded operation 

D. Unusable 

E. Subliminal performance 

F. All of the above 


13. Tracking a product on a quarterly basis often shows 


A. A relaxation of process control 

B. Incorporation of old marginal components 

C. Failure to incorporate the latest changes into service manuals 

D. Knowledgeable personnel transferred to other products 

E. All of the above 

14. If we consider recovery, software and procedural problems as human error, human error can account for what percentage of 
outage and downtime problems? 


a. Outage frequency, percent of events/crashes 

b. Downtime (3.5 min), percent per year per machine 


B. 55 C. 62 

B. 51 C. 68 


1 5. As a benchmark for quality operations to achieve product reliability, what is a reasonable goal (failures per week) for a mature 
plant? 


A. 3.0 B. 1.0 C. 0.5 

16. While the MTBF is between T(mm) and T(spec), progress is tracked toward what goal? 


A. 7(design) B. T(spec) C. /’(intrinsic) 

1 7. The key to reliability growth is to have the growth measurement encompass 


A. The design phase 

B. The manufacturing phase 

C. The testing phase 

D. The user phase 

E. The entire life cycle of the product 
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1 8. For a No. 4 ESS system in the field-test phase, the number of interrupts per day can be 
A. <20 B. >20 C. 40 


19. An electronic system must achieve manufacturing, shipping, and installation 
gain and keep market share? 


maturity in what period of time (months) to 


a. Small system 

A. 1 

B. 2 

C. 

3 

b. Medium system 

A. 4 

B. 6 

C. 

12 

c. Large system 

A. 12 

B. 8 

c. 

16 
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Chapter 11 

Designing for Maintainability and System 
Availability 


Introduction 

The final goal for a delivered system (aircraft, a car, an 
avionics box, or a computer) should be its availability to operate 
and to perform its intended function over the expected design 
life. Hence, in designing a system, we cannot think in terms of 
delivering the system and just walking away. The system sup- 
plier needs to provide support throughout the operating life of 
the product, which involves the concepts presented in fig- 
ure 1 1-1. Here, supportability requires an effective combina- 
tion of reliability, maintainability, logistics, operations, and 
safety engineering to have a system that is available for its 
intended use throughout the designated mission lifetime (see 
the Definitions section for more details). Maintainability is the 
key to providing effective support, upkeep, modification, and 
upgrading throughout the lifetime of the system. 

This chapter will concentrate on maintainability and its 
integration into the system engineering and design process. The 
topics to be covered include the elements of maintainability , the 
total cost of ownership, and the w ays that system availability, 
maintenance, and logistics costs plus spare parts costs affect 
the overall program costs. System analysis and maintainability 
will show how maintainability fits into the overall systems 
approach to project development. Maintainability processes 
and documents will focus on how maintainability is to be 
performed and what documents are typically generated for a 
large-scale program. Maintainability analysis shows how 
tradeoffs can be performed for various alternative components. 
Note that the majority of the mathematical analysis and ex- 
amples will concentrate on maintainability analysis at the 
component level or below. In a highly complex and redundant 
system, the evaluation availability at a system level may be 
extremely difficult and is beyond the scope of this manual. 
Redundancy, switches and software that can be used to bypass 
failed subsystems, and other methodologies can allow a system 


to operate even with some system degradation. The treatment 
of these types of problems is beyond the scope of this manual. 
Finally, specific problems for hands-on training follow the 
concluding section. 

Definitions 

Reliability is the probability that an item can perform its 
intended functions for a specific interval under stated condi- 
tions. What is the chance that a failure will stop the system from 
operating? Usually the failure is random and unexpected, not 
predicted as with brake wearout or a clutch or fatigue failure 
when a given input load spectrum is known. 

Availability is a measure of the degree to which an item is in 
the operable and commitable state at the start of the mission, 
when the mission is called for at an unknown (random) point in 
time. Also, it is the probability of system readiness over a long 
interval of time. Will the system be ready to operate when 
needed? Does it have very high reliability or very small main- 
tenance requirements (easily maintainable and having a good 
supply of spare parts) or a combination of both? For example, 
what was the percentage of times a car started out of the total 
number of tries over its lifetime? Alternatively, how many days 
was it in the driveway ready to start as opposed to being in the 
garage for repairs? 

Maintainability is a system effectiveness concept that meas- 
ures the ease and rapidity with which a system or equipment is 
restored to operational status after failing. Also, it is the 
probability that a failed system can be restored to operating 
condition in a specified interval of downtime. How easy is it to 
diagnose the problems in a failed (or marginally operable) 
system and how easy is it to replace the failed components (or 
software) after this diagnosis has been made? If a system is not 
reliable and is prone to partial or complete failures, if it is dif- 
ficult to find out what is causing the system to malfunction, or 
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Figure 11-1. — System supportability requirements. 
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if it is difficult to get to and replace failed components, we have 
a serious problem that must be corrected (ref. 1). 

Safety analysis is that which considers the possible types, 
reasons, and effects of operation and failures on the system as 
they affect the personal safety of those who operate or maintain it. 

Logistics is the art and science of the management, engineer- 
ing, and technical activities concerned with requirements, 
design, and planning and maintaining resources to support 
objectives, plans, and operations. 

Operations defines the environment, schedule, loading, and 
input and output parameters a system will need to function and 
the tasks it will perform. 

Importance of Maintainability 

The importance of maintainability is further noted in 
figure 1 1-2. Too often, the performance specifications or the 
appearance of a product are the overriding factors in its acqui- 
sition or purchase. This attitude can be extremely detrimental, 
especially when the first failure occurs and it is realized that the 
availability of critical parts and the ease of maintenance keep 
critical systems operating. A large integrated system can come 
from the best possible design, utilizing the newest technology; 
it can be a work of art and outperform any competitive system, 
but who would want it if 

• System breakdowns could not be diagnosed to a level of 
detail needed to pinpoint the problem in a short time. 

• Spare parts were not readily available. 

• Repair required extremely long lead times. 

• Installing the spare parts was extremely difficult. 

• Checkout and/or alignment of spare parts was difficult. 

For all practical purposes, such a system is not available 
(operational). 

Elements of Maintainability 

We need to consider up front in our design what must be done 
to maintain the system. Either the system will not fail for the 
entire mission or some parts of the system will fail and will need 
to be replaced. If we do not have a system with perfect reliability 
(there is wearout), the following questions (as illustrated by 
fig. 1 1-3) should be asked: 

(1) What parts have high failure rates and how will their 
failure be diagnosed? For example, if a cathode ray tube (CRT) 
screen does not show a display, has the screen failed or has a 
power supply failed or has a computer stopped sending the 
screen data? 

(2) Can various problems be diagnosed easily? How quickly 
can the problem be diagnosed? If there is an intermittent fault, 



Figure 1 1-2.— Importance of maintainability. 


can information during this anomaly be retrieved later? If a 
failure cannot be isolated or if insufficient diagnostic capa- 
bilities are built into the system, restoration can be a time- 
consuming task. 

(3) How quickly can the system be repaired? Has the system 
been segmented into easily replaceable units? Are parts buried 
on top of one another with hundreds of attachment points 
between units? Also, can software be used to detect and route 
around a hardware failure and make the failure transparent to 
the user? 

(4) Where will spare parts be stored? How many spare units 
should be ordered? Will parts for a unit in Washington be lost 
in a warehouse in Los Angeles? Will there be an oversupply of 
one unit and a shortage of another? 

(5) Will a failed unit be discarded or repaired? If it is to be 
repaired, where should it be repaired? What equipment and 
personnel are required to do the work? 

(6) Will unique parts be available to repair the unit? Will 
some unique part such as a traveling wave tube or a low-noise 
amplifier still be manufactured when it is needed to be replaced 
to repair a unit? Will the supplier who sold the unit repair it? If 
repairs are agreed to, will the supplier still be in business 
(logistics issues)? 

When a product is planned, all these questions must be 
answered. Although some of these questions overlap with 
logistics (the science of supply and support of a system through- 
out its product life cycle), they must all be addressed. Early in 
the design phase of the product, the maintenance concept to be 
used for the system and the design for maintainability must be 
examined first. The following definitions will be helpful in 
making decisions in the design phase. 
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Figure 1 1-3.— Elements of maintainability. 
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lack of spare parts, additional units must be procured to have the 
fleet strength at the desired level (whether it is delivery vehicles 
or research aircraft). The total cost of ownership includes 

• Total life-cycle: more than just the cost of flight units and a 
prototype unit 

• Availability of the unit: more than the advertised features 
when it is running (backup systems needed for excessive 
downtime) 

• Maintenance and logistics: often 40 to 60 percent of the total 
system costs 

• Spares: a function of reliability and speed with which the 
system can be maintained 

Often all the costs associated with a project are not consid- 
ered. Besides just the cost of producing the units, a huge amount 
of time and money must be expended keeping them operational 
throughout the mission lifetime. Total project costs are consid- 
ered in table 1 1-1. Evident from the table is that total system 
costs include design and development costs and a whole host of 
training, operations, and maintenance costs. 

As the quality and reliability of the system increase, the cost 
of the system classically increases. However, this increase may 
not necessarily occur because as the quality and reliability of 
the system are improved, the cost of maintenance, logistics, and 
spares decreases. Since total support costs are a function of 
maintenance costs and the cost of the total number of spares, 
spare repair, and spare transport, improved reliability drasti- 
cally reduces the total cost of ownership, also. 


TABLE 1 1-1 .—TOTAL PROJECT COSTS 


Cost item 

Cost breakdown 

Acquisition 

Design and development 

Research, trades, design, analysis, prototype production and 
test 

Production 

Operations 

Personnel, facilities, utilities, operating supplies and other 
consumables, maintenance ground operations 

Ground operations 

Ground support engineering model and test and checkout models; 
maintenance for these 

Ground support equipment 

All test, checkout, and diagnostic equipment; purchase, storage, 
and calibration of ground support equipment 

Technical data 

All manuals, specifications, configuration management; software 
configuration management, data base, storage 

Training 

Continuous training of all operations and maintenance personnel 

Maintenance 

Calibration, repair, and system downtime 
Repair facilities 

Labs, depots, and others 

Test equipment 

Equipment used for maintenance, alignment, and calibration of the 
system; equipment used for recertification (e.g., flight) 

Software 

Maintenance, upgrades, test, and installation 

Logistics 

Packaging, storage, transportation, and handling; tracking support 

Spares 

Spare orbital replacement units and line replacement units; long- 
lead-time items and critical components 

Disposal 

Disassembling and recycling; disposing of hazardous waste 


Risk management 



Ground support 
equipment 

Technical data 

Maintenance 

Test equipment 


Training 


Software 


Disposal 

Figure 1 1-4.— Hidden system costs. 

Total Cost of Ownership 

The total life-cycle cost of a unit must be assessed when 
evaluating project cost. The need to support the system through 
an effective logistics program that includes maintainability is 
of paramount importance (fig. 1 1-4). 

The project can follow a faster development course and 
procure less reliable hardware; however, the maintenance cost 
will make the project more expensive. Additionally, if the unit 
is not available because of lengthy maintenance processes or 
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Maintainability and Systems Engineering 

Figure 1 1-5 gives a global overview of a long-term research 
project, such as the space program, and shows maintainability 
as an integral part of it. The Horizon Mission Methodology 
(HMM) was developed initially for the study of breakthrough 
space technology . The HMM’s are hypothetical space missions 
whose performance requirements cannot be met, even by 
extrapolating known space technologies. The missions serve to 
develop conceptual thinking and depart from simple projec- 
tions and variations of existing capabilities. 

The use of HMM s with breakthrough technology options 
(BTO’s) has been an attempt to provide a systematic analytical 
approach to evaluate and identify technological requirements 
forBTO’sand to assess their potential for providing revolution- 
ary capabilities for advanced space missions. 

Therefore, we can think of the space program (or other major 
research program) not just as a number of isolated projects but 
as a single unified program with a global goal (e.g., landing men 
on the Moon or planning a manned mission to Mars or estab- 
lishing a permanent manned lunar base). 



Figure 1 1-5.— Systems engineering and operations. 


The program concept assumes a single consistent objective. 
It involves putting tested and proven equipment together to 
perform a step toward the goal. Another area of work involves 
developing technology and components and conducting 
ongoing exploration with the outer fringes of what lies ahead. 
At an individual project level, a number of different disciplines 
are brought together to design, develop, deploy, and operate the 
project. One of these disciplines is maintainability. Expanding 
the various maintainability activities over project phases gives 
us the chart of figure 1 1-6. Systems engineering at the National 
Aeronautics and Space Administration (NASA) uses five phases 
to describe a mission. Note that the maintainability program is 
run across all five phases. The task descriptions are also shown. 

The various activities are defined in the following sections. 
Of great importance is that the maintainability concept of the 
project be introduced early in the program. Without this intro- 
duction, long-term missions will see costs rise and downtime 
increase. True, initial development costs may increase, but 
total cost will decrease. In some cases, projects have ignored 
maintainability and built in diagnostics to obtain budgetary 
approval of a new system. However, the final costs always 
increase as a result of this practice (ref. 2). 

Finally, figure 1 1-7 shows the interrelationship of the vari- 
ous project tasks and how work and information flow between 
operations, reliability, and logistics functions. Basically, sys- 
tems operation and mission requirements are evaluated to 
generate the maintainability concept. This concept is further 
affected by component reliability and the various reliability 
analyses performed. This maintenance analysis is then inte- 
grated with design engineering to develop a design that can be 
repaired and maintained. 

Maintainability data and requirements flow to logistics to 
allow development of an effective support resource program. 
The output of the maintenance analysis is also critical to the 
logistics support analysis. 1 The logistics support analysis 
record (LS AR) and support resource development feed the plan 
for (1) facilities to house equipment or ground operations, (2) 
ground support equipment, (3) the logistics plan and other 
activities, (4) data (technical publication) for equipment opera- 
tion and maintenance, and (5) identification of personnel and 
training needed to maintain, repair, and support the equipment. 
Finally, a maintainability demonstration is performed to evalu- 
ate the actual times needed to diagnose and physically changeout 
a line replaceable unit (LRU) or an orbital replaceable unit 
(ORU). 

^he following general guideline distinguishes support, logistics, and 
maintenance for this manual. Supportability encompasses all logistics, 
maintainability, and sustaining engineering. Logistics is involved with 
all movement of orbital replaceable units (ORlTs) and spare parts, the 
procuring and staging of spare parts, and the development of storage 
containers. Maintainability is responsible for (after the ORlTs are located) 
repairing ORU s, shop replaceable units (SR U s), printed circuit boards 
(PCB s), which includes test and diagnostic equipment, tools; providing 
training, a suitable workarea, and maintenance personnel. 
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Systems Engineering: Maintainability/Integrated Logistics Support 


Phase A: 
Preanalysis 

Phase B: 
Definition 

Phase C: 
Design 

Phase D: 
Development/ 
Testing 

Phase E: 
Production 
Operation 
Maintenance 

Maintainab 

ility Program f 

r 

Management 


1 


Maintainability Analysis: Tradeoffs 




1 n - — | 

Maintainability Concepts: Requirement and Availability 



Sparing Concept 

1 i i 1 


Level of Maintenance 


Personnel and Training 
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Schedule: Preventative Maintenance 


Replacement Policy 


Support Equipment 



Supplier Maintainability Analysis 

Maintainability Design Criteria 



Maintainability Demonstration 

i " i — 


Figure 1 1-6— Maintainability in system life cycle. 


Maintainability Processes and Documents 

The mission requirements analysis and the operational 
requirements of a new system are derived from the initial needs 
and wants of the community. Directly and simultaneously 
derived from this is the system maintenance concept (as 
described in the maintenance concept document (MCD)). 

At this time, an initial draft of maintenance requirements 
should also be developed. Operational requirements and sys- 
tem requirements are funneled into the maintenance concept 
document, which covers every aspect of a maintenance pro- 
gram throughout the life of the system (see fig. 1 1—8) (ref. 3). 

First Phase 

The first phase involves planning and designing because 
maintainability is made a part of the design process, which 
includes making components easy to service. In this first step, 
ORlTs (orbital replaceable units) or LRU’s (line replaceable 
units) are selected. As the name implies, replaceable units can 


be quickly changed out to bring the system back into operation. 
To speed the system back into operation, it is typically divided 
into units that can easily be replaced on-orbit or on the flight 
line. A module or system is designated an ORU or an LRU if 
that part of the design has high modularity (can be self- 
contained, such as a power supply) and low connectivity (a 
minimum of power and data cables to other parts of the system) . 
As we will discuss later, we must be able to diagnose that an 
ORU or LRU has failed. This means that maintenance on-orbit 
(or on the flight line) will only replace these items. The system 
is built, tested, shipped, and put into operation. Operations and 
maintenance training are also conducted. 

The maintainability analysis (see fig. 1 1-9) also uses (1) the 
predicted time for corrective maintenance times the number of 
failures, (2) the predicted times preventive maintenance (PM) 
times the number of scheduled PM’s and predicted times 
changeout of limited-life items times the number of scheduled 
changeouts. With these times, a prediction of overall mainte- 
nance time per period is made. Assuming that the system is shut 
down during maintenance, we can then predict availability. 
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Figure 1 1-7.— Maintainability in systems engineering process. 
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2. Performance of 
maintenance action 





Figure 1 1-8.— I Maintainability activities. 
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Figure 11—9. Maintainability analysis process. 


As the design matures and the failure mode and effects analysis/ 
critical items list (FMEA/CIL) and supplier maintainability 
program data mature, the overall availability (as well as other 
maintainability figures of merit) is recalculated. The data 
generated by the maintainability analysis serves to appraise 
project management of the overall maturity of the design and 
the ability of the design to meet program objectives. 


Second Phase 

The second phase of maintenance is handling failures, per- 
forming preventive maintenance, and replacing life-limited 
items. Eventually the deployed unit breaks down. The failure 
must be detected and isolated from the actual failed ORU/LRU. 
How is the failure detected, and how is the maintenance action 
planned and executed? Can it be combined with any other 
maintenance actions or preventive maintenance activities? The 
on-orbit or flight line maintenance is performed by removing 
and replacing the failed unit. But what do we do with the broken 
ORU/LRU? 


Third Phase 

The third phase involves the handling of failed components. 
Here, repair-level analysis evaluates the failed ORU or LRU to 
determine whether it should be repaired or replaced. If repaired, 
it may be done in-house (intermediate maintenance at a main- 
tenance depot where more specialized equipment and better 
diagnostic instrumentation might be available) or at the factory. 
(The following section discusses the Maintenance Concept 
Document in more detail.) Then the unit needs to be recertified, 
retested, finally checked out, and returned to the spare parts 
storage area (preferably bonded storage). 

Only by developing the complete maintenance concept and 
the maintenance requirements early in the development pro- 
cess will the design really be impacted by maintenance needs. 
The operational requirements document, the mission (or sci- 
ence) requirements document, and the maintainability concept 
document with preliminary requirements should be the design 
drivers. Only then can effective trade studies, systems analysis 
and functional analysis, and allocation be performed. Also, 
trade studies with reliability and maintainability alternatives 
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Figure 1 1 —1 0. — Maintainability documentation. 


can be used to evaluate total system cost. Reliability and 
maintainability alternative selections will drive maintenance and 
repair costs, shipping costs, ORU/LRU spare costs, long-lead- 
time components, and components manufactured by complex 
processes. 

Documents 

Several documents (fig. 11-10) typically support a large- 
scale engineering project (some describe the activities already 
discussed). They officially begin with a basic plan and the 
maintenance concept document (MCD). The MCD together 
with the operations concept document and the science require- 
ments are the chief design and cost drivers for the future system. 
The individual documents are as follows: 

Maintainability program plan (MPP) (required). — This doc- 
ument defines the overall maintainability program, activities, 
documents to be generated, responsibilities, interfaces with the 
logistics function, and the general approach to the analysis of 
maintenance. 


Maintenance concept document (MCD) (required). This 
document defines the proposed way maintenance is to be 
performed on the product (see fig. 1 1-1 1); gives details of the 
aims of the maintenance program and support locations; 
describes the way all maintenance actives are to be carried out 
(details of support and logistics may additionally be specified 
depending on document requirements); defines the input and 
output data requirements and the scheduling of maintenance 
activities, including the following sections: 

Mission profde/ system operational availability : How often 
and over what period of time is the system operational? What 
is the geographic deployment of the system and where is the 
location of the system that needs to be repaired? 

System-level maintainability requirements : What are the 
allocated and actual reliability requirements and maintainabil- 
ity requirements (MTTR, MTBF, MLDT , MDT~ )? 

Design requirements : What constitutes a maintainable ele- 
ment that can be removed or replaced (e.g., an orbital replace- 
able unit (ORU) or a line replaceable unit (LRU)?). What are 
the sizes and weight limits? 


2 MTTR, mean time to repair; MTBF, mean time between failures. MLDT, 
mean logistic delay time; MDT, maintenance downtime. 
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Figure 11-11. — Factors affecting maintainability. 


Diagnostic principles and concepts : How will a failure be 
detected and isolated? How will repairs be evaluated? 

Requirements for suppliers : What information about parts 
and components must the supplier give? How will the first-, 
second-, and third-tier suppliers support their products? How 
quickly will they be available and for how long will they be 
available? 

Repair versus replacement policy: How is the decision made 
to repair or replace a unit? Ifrepaired, how is the unit requalified? 

Repair level analysis : Where will different failures be 
repaired? Which repairs will be made on-orbit (or on the flight- 
line)? Which repairs will be made at an intermediate mainte- 
nance facility (depot) and which will be made at the factory? 

Tools and test equipment : What diagnostic, alignment, and 
check-out tools will be required for each level of maintenance 
(repair)? 

Personnel and training: What is the level of training required 
for the units at each level of maintenance (from simple remove 
and replace to detailed troubleshooting of an ORU/LRU)? 

Crew considerations: What time will be allocated for preven- 
tive and corrective maintenance? How much time can a flight 
crew and a ground crew give to maintenance during or between 
missions? 

Sparing concepts: Which spares will be onboard versus those 
delivered when needed? Will failed units be repaired or 
replaced? What are the general repair policies? 

Elements of logistic support (optional): Where will all the 
test and ground support equipment and inventory control sup- 
plies be located? 


Maintenance plan (MP) (required).— This document 
defines the actual way maintenance is to be performed on the 
product. The MP gives detailed requirements for repair or 
replacement analysis, the location for and levels of mainte- 
nance, and other detailed requirements for performing the 
maintenance. 


Maintainability design guidelines (MDG ) (optional). — Thi s 
guideline contains suggestions, checklists, and descriptions of 
ways to make the design maintainable. Related safety and 
human factors and factors to consider for vendors and transpor- 
tation may also be considered. 


Maintainability requirements document (MRD) 
(required). — This document gives the specific requirements 
(criteria) that will facilitate maintenance or repair in the predicted 
environment. It contains all maintainability requirements. 

Maintainability analysis plan (MAP) (required). — The 
maintainability analysis plan specifies how the maintainability 
of the system is assessed. It also documents the process that 
translates system operational and support requirements into 
detailed quantitative and qualitative maintainability require- 
ments with the associated hardware design criteria and support 
requirements and provides basic analysis information on each 
ORU/LRU. This document includes evaluation processes for 
preventive, corrective, and emergency maintenance. The MAP 
documents the formal procedure for evaluating system and 
equipment design, 3 using prediction techniques failure modes 
and effects analysis, procedures and design data to evolve a 
comprehensive, quantitative description of maintainability 
design status, problem areas and corrective action 
requirements. 

Supplier maintainability analysis plan (optional). This 

document outlines methodology to evaluate suppliers for con- 
formance to maintainability standards. 

Maintenance analysis document (required).— This docu- 
ment provides the details of how each ORU/LRU is to be 
maintained and includes detailed maintenance tasks, mainte- 
nance task requirements, and maintenance support require- 
ments. 


Maintainability demonstration plan (optional).— This plan 
documents the process that translates (and verifies) system 
operational and support requirements into actual test plans for 
the maintainability of systems and subsystems. The output, the 
maintainability demonstration report, includes MTTR’s and 
maintenance descriptions (ref. 4). 


To help thereaderdistinguish between the various aspects of maintainabil- 
ity evaluation, the following is useful. The three stages to the overall evaluation 
process are (I) engineering design analysis, (2) maintainability analysis, and 
(3) the maintainability demonstration. Engineering design analysis includes 
the initial trade studies and evaluation to determine the optimum ORU design 
configuration. Also, identified are safety hazards, reaction time constraints for 
critical maintenance, and an evaluation of diagnostic alternatives. Maintain- 
ability analysis includes an expanded detailed analysis of the final design to 
determine all maintainability system parameters. The maintainability demon- 
stration then specifies tests to verify the data collected during the maintainabil- 
ity analysis. 
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Figure 1 1-12.— Maintenance of limited-life items. 


Maintainability Analysis Mathematics 

As previously stated, the goal of system performance is to 
have the system available when it is need. As figure 11-11 
shows, the failure rate, the mean time to repair, the time 
to acquire spares, and operational constraints all affect avail- 
ability. 

Availability requirements can be met with an extremely 
reliable system, one that is easy to repair and has an adequate 
supply of spare parts, or a combination of both. System use and 
mission profile also affect system availability requirements. 
The following list gives examples of continuous and intermit- 
tent mission requirements (ref. 5). 

Is continuous operation required as for a critical life support 
system on a space station or an air traffic control system? If so, 
the reliability has to be very high and/or backup systems may 
be needed: 

• Continuous operation 
° Spacecraft (LEO) 

° Space station 
0 Air traffic control system 

• Intermittent operation (on demand) 

0 Emergency vehicle 

0 Research fighter 
0 Shipboard gattling gun 

• Intermittent operation (scheduled) 

0 Space experiment 

° CAT scan or MRI equipment in hospital 
° Space Shuttle main engines 


An intermittent operation requirement is different. If avail- 
ability is on demand, the built-in-test/built-in-test-equipment 
(BIT/BITE) and preventive maintenance functions have to be 
perfected and evaluated (through accumulating many hours on 
similar units). However, downtime for preventive maintenance 
has to be accounted for with spare systems. If there is scheduled 
intermittent operation, critical components can be replaced or 
continuously monitored (ref. 6). 

For the mathematical analysis that follows, we will assume that 
we have a system that requires continuous operation except for 
scheduled preventive maintenance, that a temporary backup 
system exists, or that the system can be down for short periods. 
Once the system is put into operation, it might experience 
periods when not all features are operating but the failures can 
be tolerated until the next scheduled preventive maintenance 
(e.g., failure of a monitoring sensor or a BIT/BITE function). 

Maintenance includes (1) corrective maintenance, the re- 
placement of failed components or ORU’s and LRU’s; (2) 
preventive maintenance, 4 scheduled maintenance identified in 
the design phase as solution, alignment, calibration, or replace- 
ment of wear items such as clutches, seals, or belts: (3) replace- 
ment of life-limited items such as those illustrated in fig- 
ure 11-12. Distinctions must be made between the availability 
calculated from the MTBF that is only valid in region II and the 
availability once a component enters its wearout region. Here 
the failure rate may increase exponentially, and it is more 

4 Pre ventive maintenance can also include software. Fixing corrupted tables, 
updating data bases, and loading revisions of software are an important part of 
scheduled maintenance. 
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difficult to predict. The generally accepted practice is to replace 
life-limited items before they enter their wearout period. If the 
mission life extends into region III (wearout), the part is a Iife- 
limitedcomponentand will be replaced before the beginning of 
the wearout stage at time t2. If the mission life is somewhere in 
region II, the component will only be replaced if it fails 
randomly. No scheduled replacement time will be made. 

Availability can be calculated as the ratio of operating time 
to total time, where the denominator, total time, can be divided 
into operation time (uptime) and downtime. System availabil- 
ity depends on any factor that contributes to downtime. Under- 
pinning system availability, then, are the reliability and 
maintainability of the system design; however, support factors, 
particularly logistics delay time, also play a critical role espe- 
cially when a long supply line exists (such as with the Interna- 
tional Space Station (ISS)). Assuming these factors remain the 
same, the following availability figures of merit can be 
calculated: 


Inherent availability = — - 

MTBF + MTTR 

where MTBF is the mean time between failures and MTTR is 
the mean time to repair. Inherent availability considers only 
maintenance of failed units. 


Achieved availability = 


MTTMA 

MTTMA + MMT 


where MTTMA is the mean time to a maintenance action 
(corrective, preventive, and replacement of limited-life items) 
and MMT is the mean (active) maintenance time (corrective, 
preventive, and replacement of limited-life items). Achieved 
availability includes inherent availability plus consideration 
for time spent for preventive maintenance and maintenance of 
life-limited items. 

Operational availability 

MTTMA 

MTTMA + MMT + MLDT + MADT 

where MLDT is the mean logistics delay time (includes down- 
time due to waiting time for spares or waiting for equipment or 
supplies). Maintenance downtime is the time spent waiting for 
a spare part to become available or time waiting for test equip- 
ment, transportation or a facility area to perform maintenance. 
For this discussion, it does not include local delivery such as 
going to a local storage location and returning to the work sight 
and returning the used part to a location for transport to a repair 
facility. MADT is the mean administrative delay time and 
includes downtime due to administrative delays, waiting for 
maintenance personnel, time when maintenance is delayed due 
to personnel being assigned elsewhere, filling out forms and 
signing out the part. Operational availability includes achieved 
availability plus consideration for all delay times. 

Availability measures can also be calculated for a point in 
time or for an average over a period of time. Availability can 
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Figure 11—1 3. — Maintainability during system operation. 
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also be evaluated for a degraded system. For the remainder of 
our discussion, we will assume average availability and main- 
tainability factors. 

Other important factors in calculating availability include ( 1 ) 
maximum allowable time to restore, (2) proportions of faults 
and percentage of time detected as a function of failure mode, 

(3) maximum false alarm rate for built-in test equipment, and 

(4) maximum allowable crew time for maintenance activities. 
We also want to look in detail at an individual corrective 

maintenance action. A number of elements make up a mainte- 
nance action and once they are combined, other factors must be 
considered before the overall impact on crew hours, mainte- 
nance hours, and other maintenance parameters are determined 
(fig. 1 1-13). These elements are (ref. 7) 

( 1 ) Maintainability prediction using the most effective meth- 
ods available emphasizes an estimation of the time to restore at 
the ORU/LRU level. For a failed unit, the time to restore is the 
total corrective maintenance time T in minutes for each ORU: 

T = DI + DL + GA + RR + SR + CK + CU 

where 

DI diagnostic time to detect and isolate a fault to the ORU 
level, min 

DL local delivery of spare ORU/LRU as opposed to shipping 
in from a remote location, min 
GA time required to gain access to the failed ORU, min 
RR time required to remove and replace the defective ORU, min 
SR time required to restore system (including alignment, 
checkout, and calibration), min 
CK time required to complete system checkout, min 
CU time required to close up system, min 

(2) The mean time to repair (MTTR) the ORU (on-orbit) 
follows. For this exercise, assume a crew size of one for all 
repair operations: 


MTTR 0 ru=^|^ 

where MTTR is in hours and Z is the conversion factor for 
1 to 1CT 6 g. 

(3) The Mean time to a maintenance action (MTTM A) based 
on a yearly average is 


MMHYp + MMHYp + MMHY,) 

8640 

where MMHY is the preventive maintenance hours per year, 
the subscripts p and / denote preventive and life-limited 


MTTMA = 


replacement, respectively, and 8640 is the number of hours in 
one year. 

(4) The maintenance hours per year (MMHY) for corrective 
(c), preventive (p) and life-limited replacement (/) follow: 

„ ( 8640 'j 

MMHY C . = DC x MTTRqru * * . * ) 

MMHYp = MMP x F(P) 

MMHY MmQRU 

T, 


where 

DC duty cycle of ORU, percent 
MTBF mean time between failures, hr 
MTBM mean time between maintenance, hr 
MMP mean hours to perform preventive task, hr 
F(P) preventive task frequency per year 
K MTBF to MTBM conversion factor 
T, life limit for ORU, hr 

(5) Maximum corrective maintenance time M max is the +90 
percent time for a normal distribution. It is assumed that since 
this is a manual operation and not the subject of wearout, the 
normal distribution will apply: 

M ma x = MTTRqru + ( 1-61 xct) 

where a is the standard deviation of the repair time. 

Plots of typical inherent availability are presented in fig- 
ure 1 1-14 as a function of MTTR and MTBF. Here, solving the 
expression 


Inherent availability = 


MTBF 

MTBF + MTTR 



Figure 1 1-14. — Relationship of MTTR and MTBF to availability. 
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Failure rate, A, failures/10 6 hr 


Figure 11-15. — Relationship of MTTR and failure rate to 
availability. 


gives 

MTTR = (1 - inherent availability) x MTBF 

Figure 1 1-1 5 shows MTTR as a function of failure rate (assum- 
ing an exponential rate). For an exponential distribution, the 
failure rate X is 1/MTBF. Substituting this into the above 
expression for inherent availability and solving for MTTR 
yields the results shown. 


Additional Considerations 

As previously mentioned, to speed the system back into 
operation, it is typically divided into units (ORU’s/LRU’s) that 
can be easily replaced, either on-orbit or on the flight line. This 
means that maintenance on-orbit (or on the flight line) will 
usually only replace these items. The following are important 
questions we need to ask for our maintainability analysis 
(ref. 8): 

• How much downtime is acceptable? 

• What will be replaced on the flight line (what should be 
designated an LRU or an ORU)? 

• How will a failure be diagnosed and isolated to an ORU/ 
LRU, a BIT/BITE, manual processes, software, or a combi- 
nation? 

• Will the failed units be scrapped or repaired? 

• If repaired, what should be repaired for each type of failure? 
Where should it be repaired (depot, lab, factory) and by what 
skill level? 

• What preventive maintenance needs to be performed? 

• What kind of maintenance tests need to be performed? 

• Can all components be inspected for structural defects? 

• How will structural defects be detected and tracked? 

• Have acceptable damage limits been specified? 

• Are safety-related components easy to replace? 

• Are there safety issues that occur during maintenance? 

• How is corrosion controlled? 

• Are limited-life items tracked for maintenance? 


A combination of built-in testing and diagnostic procedures 
(with the needed tools and instruments) must be available to 
diagnose a fault or failure to at least one ORU/LRU level. If it 
cannot be determined with that fidelity, the wrong item might 
be replaced. The built-in test procedures begin with specific 
questions: 

• Do we know what is going to fail? 

0 Do maintenance records allow preventive maintenance 
where critical items are replaced at a known percentage of 
life? 

0 Do smart diagnostic features sense impending failures? 

• Do we know what has failed? 

0 Does built-in test equipment quickly diagnose 
the problems? 

° Does readily available external test equipment quickly 
diagnose the problems? 

• Do we know how we are going to handle each failure? 

° Has a repair analysis been performed on all likely failures? 

How will each failure be diagnosed and repaired? 

° Has the failure modes and effects analysis (FMEA) been 
evaluated for failures and corrective actions? 

The questions that remain are Can all plausible and probable 
failure modes (based on the FMEA/CIL) be diagnosed with 
BIT/BITE? and Can the necessary diagnostic procedures be 
carried out by a crew member or technician on the flight line? 
The answers to these questions determine the design concept 
for maintainability. The aim of this analysis is to reduce 
downtime. 


Requirements and Maintainability Guidelines for ORU’s 

Other requirements to evaluate ORU’s/LRU’s follow. 

(1) On-orbit replacements of ORU’s should not require 
calibrations, alignments, or adjustments. Replacements of like 
items in ORU’s should be made without adjustments or align- 
ments (this will minimize maintenance time). 

(2) Items that have different functional properties should be 
identifiable and distinguishable and should not be physically 
interchangeable. Provisions should be incorporated to preclude 
installation of the wrong (but physically similar) cards, compo- 
nents, cables, or ORU’s with different internal components or 
engineering, revision number, and so forth. Reprogramming, 
changing firmware, and changing internal switch settings may 
be allowed with special procedures and safeguards. 

(3) All replaceable items should be designed so that it will be 
physically impossible to insert them incorrectly. This is a basic 
maintainability and safety requirement. 
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Additional maintainability considerations that should be 
incorporated in the design are 

(1) Any ORU, shop replaceable unit (SRU) 5 , their subcom- 
ponents, or cards that are physically identical should be inter- 
changeable (excluding cables and connectors). Identical 
hardware (e.g., a signal conditioning card) shall not be made 
unique. Different software and switch settings do not affect 
identity. The ability to replace ORU’s with an identical unit 
from an inactive rack will improve availability. 

(2) Standardization should be incorporated to the maximum 
extent through the design. In the interest of developing an 
efficient supply support capability and in attaining the avail- 
ability goals, the number of different types of spares should be 
held to a minimum. 

(3) The ORU should be designed from standard off-the-shelf 
components and parts. 

(4) The same items and/or parts should be used in similar 

ORU’s with similar applications (e.g., boards, fasteners, switches, 

and other human interface items; fuses, cable color designations, 
and connectors (except to avoid improper hook-ups)). 

(5) Equipment control panel positions and layouts (from 
panel to panel) should be the same or similar when a number of 
panels are incorporated and provide comparable functions. 


Related Techniques and Disciplines 

Some disciplines that relate to basic maintainability analysis 
are now discussed (ref. 9). 

Supportability — This is a global term that covers all main- 
tenance and logistics activities. The unit can be supported if it 
can be maintained and if spare parts can be delivered to it. 

Reliability-centered maintenance (RCM) — This mainte- 
nance process is based on the identification of safety-critical 
failure modes and deterioration mechanisms through engineer- 
ing analyses and experience. Thus, the consequences of the 
failure can be determined on the basis of severity level so that 
maintenance tasks can be allocated according to severity level 
and risk. The RCM logic process considers maintenance task 
relative to (1) hard-time replacements in which degradation 
because of age or usage is prevented and maintenance is at 
predetermined intervals; (2) on-condition maintenance in which 
degradation is detected by periodic inspections and (3) condi- 
tional maintenance in which degradation prior to failure is 
detected by instrumentation and/or measurements. 

Integrated logistics support . — This includes the distribu- 
tion, maintenance, and support functions for systems and 
products: (1) maintenance, (2) supportability, (3) test and 
support equipment, (4) personnel training, (5) operations facili- 
ties, (6) data (manuals), (7) computer resources (for mainte- 

5 A part or component that is designed and7or designated to be replaced in a 
depot or at the manufacturer. For instance, it may be highly modular but its 
failure cannot be easily detected on-orbit or on the flight line. 



Failure, rate, X 

Figure 1 1-1 6.*— Effect of quality on maintainability. 


nance of equipment and software), and (8) disposal. Personnel 
considerations involve analyzing what level of expertise is 
needed for each level of maintenance (on the flight line, in a 
depot (intermediate repair facility), or in the factory) to effec- 
tively perform the repairs. 

Maintainability, quality, and reliability. — Figure 11-16 
shows the relationship between the three. As quality and manu- 
facturing techniques improve, reliability increases. Therefore, 
for the same availability, MTTR may increase and a higher 
availability may be attained. The reliability of the product is 
given by tf product where the design stage reliability R D is modi- 
fied by various K factors. These denote probabilities that the 
design-stage reliability will not be degraded by any given factor. 
The K factors are external contributors to product failure: 

R p ro d uc t =^( K q VrK t K u ) 

where 

K manufacturing, fabrication, assembly techniques 

hT quality test methods and acceptance criteria 

K reliability fault control activities 

K l logistics activities 

K user or customer activities 

U 

Manufacturing processes or assembly techniques that are not 
statistically controlled can greatly affect reliability. Special 
cause variation, change in raw materials, or lack of adherence 
to manufacturing procedures can dramatically reduce product 
reliability. Poor test methods may allow substandard compo- 
nents to be used in a product that would fail final test screenings 
and enter the operating population. Poor packing, shipping 
practices, storage, and so on will raise the failure rate. The user 
or customer may abuse the product by using it for what it was 
not intended or using it in a new unspecified environment. All 
these problems require that the system be maintainable during 
operation. 
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TABLE 1 1 - 2 . — MAINTAINABILITY FIGURES OF MERIT 
Weight of orbital replacement and line replacement units, kg 
Volume, m* 

Power requirement, W 
Definition of partial operation 
Mean time between failures, hr 
Life and wearout, hr 
Mean time to repair, hr/repair 
Failure modes and effects analysis, hr 
Manifest time, hr 
Operation time, hr 
Operation period, hr 
Spare location, sec 
Maintenance cost, dollars 
Repair cost, dollars 
Transportation, dollars 
Built-in test capabilities 
Tools required 
[Preventive maintenance 
Supportability 
Availability 


Maintainability Problems 


Reliability = exp(-Xr m ) 

= exp(-0.000443 x 50) = 0.9780 
Determine the availability: 


Availability = 


MTBF 

(MTBF + MTTR) 


2257 

(2257 + 5.5) 


0.9976 


Example 2 

Five RTD temperature sensors (model RTD-A-7) were tested 
and failed after an average of 4026 hr (time for first failure t f ). 
Time studies have shown that it takes 52 hr to diagnose, remove, 
order, receive, replace, and check out a unit (MTTR). Assum- 
ing continuous use and an exponential failure rate, what is the 
failure rate X, the reliability for a mission time t m of 50 hr, the 
MTBF, and the availability. Determine the failure rate: 


The maintainability, reliability, and cost data items in 
table 1 1-2 represent the information required to perform a 
maintainability analysis. We will consider how these items 
interact and how maintainability trades can be made. First, 
consider examples 1 and 2 (for the basic formulas, refer to the 
section Maintainability Analysis Mathematics). 


MTBF 4026 
= 0,000248 failures / hr 


The reliability is 


Example 1 

Five pressure transducers (model c— 4) were tested and failed 
afteran average of 2257 hr (time for first failure t f ). Time studies 
have shown that it takes 5.5 hr to diagnose, remove, replace, and 
check out a unit (MTTR). Assuming continuous use and an 
exponential failure rate, what is the failure rate X, the reliability 
for a mission time t m of 50 hr, the MTBF, and the availability? 
First, determine the failure rate: 


Failure 3 ( Failures 

hr J ° r [ 10 6 hr 


Reliability = exp(-/.t) 

= exp(-0.000248 x 50) = 0.9876 


The availability is 


Availability = 


MTBF 

(MTBF+MTTR) 


4026 

(4026 + 52) 


0.9872 


1 _ 1 
MTBF 2257 

= 0.000443 failures / hr or 443 failures / 10 6 hr 
Determine the reliability: 


Problem Solving Strategy 

One way to assess tradeoffs is to first evaluate conformance 
to minimum maintainability requirements and then to calculate 
the effects that the alternatives have on costs by following these 
steps: (1) determine screens, minimum or maximum 
acceptable values for a system or component; (2) determine 
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(6) System operating time per week, 4 hr 


(7) Maximum resource allocation for maintenance. 0.1 hr/wk 


(8) Operational requirement, 6 hr/wk 


(9) Total mission time, 87 360 hr 

(10) Total system operation, 2080.0 hr/yr 



(12) Transportation of board, $4500 


(13) Maintenance on-orbit, $500/hr 


VV11CU uiv ■ — — r u 

how long does it take to deliver a replacement part from the 
warehouse or factory (for the total mission, turnaround time for 
repair of boards also needs to be considered)? What is the ADT? 
How \on° will it take to process an order for spares and how long 
will it take to do other paperwork? (ADT may not affect system 
availability but it will affect total crew maintenance time used to 
repair the system.) 

What is the total time that the unit will be in the system and 
available for operation? — 

How many hours per week does the unit operate and in what modes 
(operational, standby, partial, off)? — - 

Are crews available for maintenance and operation of the unit . Is 
the MTTR reasonable so that the crew will have time to do main- 
tenance? 


Are there limits on how long an item can take to be repaired. 

(Often, if a system is difficult to repair, it may be neglected in 

favor of a more easily maintained system.) . 

What are the total clock hours the mission is to last (irrespective of 
whether the system being considered is operating)? 


What is the cost to transport a spare board to the site of field 
repairs. (If the site is remote or on-orbit, the cost may be 

considerable.) . 

What are the allocated costs for crew maintenance time on-site or 
on-orbit? (The cost of crew maintenance time may be considerable 
and significantly affect the overall tra de study costs.) 


which tradeoffs meet these screens; (3) of the systems that pass, 
calculate costs (cost of spare, cost to ship spare, cost to install 
spare); (4) determine the lowest cost system; and (5) examine 
the results for reasonableness. 

This discussion presents a more detailed analysis of how 
tradeoffs (at the board or component level) involving mainte- 
nance and reliability may be made. This is a more complex 
example for which we want to determine the lowest cost 
solution to a maintainability problem with fixed requirements 
by following the above procedures. 

Determining screening requirements . — The reliability and 
maintainability screening requirements must be determined. Here 
there is a maximum MTTR 6 related to maintenance crew avail- 
ability, a minimum MTBF due to mission restrictions, and a 
specified availability requirement needed to complete the 
mission. Theoperation of the system is intermittent. A detailed list 
of these requirements and costs is presented in table 1 1-3, which 
gives q uan titative system data needed to evaluate the model. 

The availability, maintainability, and reliability screens in 

table 1 1-3 are also portrayed graphically in figure 11-17 where 

availability is shown as a function of F(MTBF and MTTR). The 
solution space described by the system and mission require- 
ments is bounded by the 0.990 availability line, the MTBF 


minimum of 300 hr, and the MTTR maximum of 5 hr. Note also 
that in this figure the constant availability lines are generated 
with MTBF’s and MTTR’s that represent average values: 
MTTR and MTBF are usually considered distributed variables 
with an exponential or normal distribution. 

Having addressed the basic requirements imposed on the 
system and the costs associated with a maintenance action, we 
will now evaluate individual boards that are being considered 
for a black box in the system. 

First, some additional assumptions must be made. (1) Only 
one spare board is required and it is readily accessible on-orbit 
or on the flight line; (2) all spares cost the same; (3) there is no 
finance (carrying) cost; and (4) repair costs for each alternative 
board are the same. 7 

6 Strictly speaking, we do not have a “maximum MTTR since MTTR and 
also MTBF do not have distributions but are derived from a distribution. This 
notation is kept because we are looking at a number of MTTR's for vanous 
alternative boards and the like. 

7 A problem arises when the boards are stored on the ground or in a 
warehouse (forLRU' s) when there are long logistic delay times. If systems were 
in remote sites or on-orbit (with no local storage of spares) with only three or 
four deliveries of spares per year (as with the space shuttle), there might be 
considerable periods of downtime. 
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Figure 1 1-17.— Problem solution area on availability plot. 


Determining tradeoffs that meet screens. — Data required to 
evaluate each potential electronic board for a particular func- 
tion in the system are given in table 1 l^t. Board option 1 was 
discarded for failure to meet functional design parameters. 
Each remaining board (first column) was evaluated for ex- 
pected MTBF or reliability (with a parts count according to 
MIL-HDBK-217 or possibly via testing), estimated cost to 
purchase the board, estimated time to repair the board (based on 
ease of diagnosis, built-in test circuitry or software), and 
estimated LDT (based on the supplier turnaround history) and 
administrative delay time (ADT). 

The next step is to calculate the data required in table 1 1-5 to 
see if the maintainability and reliability requirements have been 
met. 

Number of maintenance actions 

_ Total mission time / wk x system operating time / wk 
Mean time between failures 

Availability = MTBF 

mtbf+mttr 

Total maintenance time 

= Number of maintenance actions / mission 
x (MTTR + LDT + ADT) 

Total maintenance time (hr / wk) 

_ Total maintenance time (hr) 

Total mission time (wk) 

Note that the maintainability screens are independent and may 
not necessarily relate to these formulas (e.g., irrespective of the 
required availability and minimum MTBF, there may be a 
maximum maintenance time allowed). After evaluating the 

"The formula for column F is F = (5X6)/B where (5) and (6) refer to items 
in table 1 1-3 and B refers to column B in table 1 l^L 


TABLE 1 1-4.— BOARD TRADEOFF OPTION DATA 
[Logistics and administrative delay times 
- LDT + APT, 0.3 hr. 1 


A 

Board 

option 

B 

Mean time 
between 
failures, 
MTBF, 
hr 

C 

Cost, 

dollars 

D 

Mean time to 
repair, 
MTTR, 
hr 

a l 

— 

— 


2 

195 

74 100 

3.7 

2a 

662 

182 900 

3.8 

3 

191 

77 600 

3.5 

3a 

583 

130 800 

3.7 

4 

199 

76 600 

33 

4a 

828 

188 257 

6.8 

5 

62 

45 400 

3.4 


“Discarded for failing to meet functional design 
parameters. ' t; 


results, we found that options 2, 3, 4, and 5 failed the minimum 
MTBF and availability screens; option 4a failed the maximum 
MTTR screen; the remaining options 2a and 4a will be evalu- 
ated to determine which has the lower cost. 

Determining the cost of acceptable systems. — Of the sys- 
tems that pass, calculate the costs of purchasing the spare and 
the board, repairing the failed unit, and shipping and installing 
the spare. These figures are shown in table 1 1-6. 

The total mission board repair cost is equal to the cost of 
repairing each board (at a depot or the factory) times the total 
number of maintenance actions. The cost of the board repair is 
$7000/repair, which would theoretically be reduced by the 
number of spares purchased. The repair cost and turnaround 
time should be part of the supplier’s bid for the board. 

The total mission board shipping cost is equal to the cost of 
transporting the board times the total number of maintenance 
actions. The cost of shipping the board is $4500 per shipment. 

The total mission board maintenance cost reflects costs to 
change out the board on-orbit or on the flight line. The cost to 
replace the board (on-orbit or on the flight line) is $500 per hr 
which assumes that the board is also an ORU or an LRU. It is 
equal to the total number of maintenance actions times 
(MTTR + LDT + ADT). 

The total mission board repair cost is equal to the total cost 
of repair, shipping, and maintenance. 

The total mission board cost is equal to the total mission 
board repair cost plus the cost of the board and one spare board. 
The cost of manufacturing the board was already given in 
column C of table 11-4. For the present example, we will 
assume that we need to purchase one board and one spare 
board.* 


9 One must also consider the quantity of spares needed to have a replacement 
board available at all times. This is a function of the desired probability of an 
available spare, the time to ship the board out for repairs, to repair it, to recertify 
it and to return it toa storage location. A detailed discussion of the mathematics 
of this evaluation is beyond the scope of this paper. Additional costs will also 
be incurred with parts storage. 
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A 

Board 

option 

F 

Number of 
maintenance 
actions per 
mission 

F = (5)(6)/B 

G 

Availability, 

percent, 

G = B/(B + D) 

H 

Total 

maintenance 

time, 

hr 

H = F(D + E a ) 

I 

Total main- 
tenance 
time, 
hr /wk 

1 = H/(5) 

2 

10.7 


42.7 

0.08 

2a 

3.1 

.994 

12.9 

.02 

3 

10.9 

.980 

41.4 

.08 

3a 

3.6 

.993 

14.3 

.03 

4 

10.5 

.982 

37.6 

.07 

4a 

2.5 

.992 

17.8 

.03 

5 

33.3 

.944 

123.3 

24 


a E = logistics and administrative delay time or u.j. 

TABLE 1 1-6. — TOTAL MISSION COST CALCULATIONS 


A 

I Board 
option 


Board repair, 
dollars/mission 

J = (U)F 


Board shipping, 
dollars/mission 

K = (12)F 


Board maintenance, 
dollars/mission 

L = (13)H 


M 
Repair, 
dollars 

M = J + K + L 


Board and spare, 
dollars 

N = (2C) + M 


2 

2a 

3 

3a 

4 
4a 

5 


74 683 
22 005 
76216 
24 965 
73 151 
17 578 
233 206 


48 011 
14 146 
48 996 
16 049 
47 026 
11 300 
149 918 


1861 

1903 

1761 

1854 

1660 

3403 

1733 


124 554 
38 055 
126 974 
42 867 
121 837 
32 280 
384 857 


272 754 
403 855 
282 174 
304 467 
275 037 
408 794 
475 657 


Determining the lowest cost system. — The solution is to pick 

the lowest-cost board that passed the screens. Options 2 to 4a 
and 5 have already failed screens. Of the remaining candidates 
2a and 3a, 3a has the lowest cost. 

Examining the results for reasonableness — As always, 
factors other than costs must be included in the analysis. Human 
factors, hierarchy of repairs, ease of problem diagnosis, ability 
to isolate faults, ability to test the unit, manufacturer’s process 
controls and experience, and the ability of the manufacturer to 
provide long-term support for the unit are some additional 
considerations. 


Recommended Techniques 

Current and future NASA programs face the challenge of 
achieving a high degree of mission success with a minimum 
degree of technical risk. Although technical risk has several 
elements, such as safety, reliability, and performance, a proven 
track record of overall system effectiveness ultimately will be 
the NASA benchmark that will foster the accomplishment of 
mission objectives within cost and schedule expectations with- 
out compromising safety or program risk. A key characteristic 
of system effectiveness is the implementation of appropriate 
levels of maintainability through the program life cycle. 

Maintainability is a process for assuring the ease with which 
a system can be restored to operation following a failure. It is 


an essential consideration for any program requiring ground 
and/or on-orbit maintenance. The Office of Safety and Mission 
Assurance (OSMA) has undertaken a continuous improvement 
initiative to develop a technical roadmap that will provide a 
path to achieving the desired degree of maintainability while 
realizing cost and schedule benefits. Although early life-cycle 
costs are a characteristic of any assurance program, 
operational cost savings and improved system availability 
almost always result from a properly administered maintain- 
ability assurance program. Experience in NASA programs has 
demonstrated the value of an effective maintainability program 
initiated early in the program life cycle. 

Technical Memorandum 4628 entitled “Recommended Tech- 
niques for Effective Maintainability” provides guidance for 
achieving continuous improvement of the life-cycle develop- 
ment process within NASA, having been developed from the 
experiences of NASA, the Department of Defense, and indus- 
try. The degree to which these proven techniques should be 
imposed resides with the project or program and will require an 
objective evaluation of the applicability of each technique. 
However, each applicable suggestion not implemented may 
represent an increase in program risk. Also, the information 
presented is consistent with OSMA policy, which advocates an 
integrated product team (IPT) approach for NASA systems 
acquisition. Therefore, this memorandum should be used to 
communicate technical knowledge that will promote proven 
maintainability design and implementation methods resulting 
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in the highest possible degree of mission success while balanc- 
ing cost effectiveness and programmatic risk. The recommend 
techniques can be found online at http://www.hq.nasa. 
gov/office/codeq/doc.pdf. 


Conclusion 

The benefit of a system maintainability program is mission 
success, the goal of every NASA System Reliability and Quality 
Assurance (SR&QA) office. 10 - 11 A well-planned maintain- 
ability program gives greater availability at lower costs. A 
design with easily maintained (and assembled) modules re- 
sults. Considering maintenance prevents the inclination to use 
lower-cost components at the expense of reliability unless 
maintainability tradeoffs justify them. Finally, maintainability 
analysis forces considerations of potential obsolescence and 
the need for upgrades 12 and reduces overall maintenance hours 
and the total cost of ownership. 
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NASA Glenn Research Center is designing a second-generation instru- 
ment to measure microgravity on the space station. The operating time for the 
instrument is expected to be 1 Oyr. Reliability analysis has shown low reliability 
for this mission even if we can get all the components to have an MTBF of 
40 000 hr. Therefore, we are developing a maintenance program with an on- 
orbit repair time of 700 hr, which should give a suitable availability for the 
mission. 

1 ] NASA Glenn had an interesting experience with one of its space instru- 
ments. It was designed for a mission time of 1 8 hr and had a rel lability greater 
than 0.90. It was suggested that we use the instrument on MIR for a 3000-hr 
mission. The reliability fell to 0.40 when this and other factors were considered. 
Maintainability was factored in with selected spare parts, software was added 
to perform built-in test (BIT) of the unit. The mission specialists were also 
trained to do repair work. The availability was returned to its previously 
acceptable level (with the previous level of reliability). The instrument has 
successfully collected data on MIR. 

1 “For example, a ruggedized optical disk drive required maintenance after 
each flight on the space shuttle or after 450 hr of operation. This process took 
4 wk, which was unacceptable to NASA when the system had to be placed on 
the Russian Space Station MIR. To correct the problem, the drives were 
replaced with another component that greatly reduced maintenance time. 
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Reliability Training 13 


I Three .hem.osh.ts were tested and failed after an average of 39 500 cycles. Time studies showed that diagnosis took an average 
' 6.8 hr to remove, replace, and check out a thermostat. What is the MTBF of the unit for a tmssion time of 168 cycles? 

C. 39 500 cycles 


A. 30 200 cycles 
What is the failure rate? 

A. 20.6x1 0" 6 failure/hr 
What is the reliability? 

A. 0.976 

What is the availability? 

A. 0.979 


B. 35 600 cycles 


B. 25.3x10"* failure/hr 


B. 0.986 


B. 0.989 


C. 30.7x10^ failure/hr 


C. 0.996 


C. 0.999 


2 . 


Three air bearings were tested and failed after an 
3200 hr to diagnose, remove, replace, and check out a 


average of 323 000 hr. It is estimated that it will take an average of 
bearing in low Earth orbit. What is the MTBF of a unit for a mission time 


of 80 000 hours? 


A. 293 000 hr/failure B. 3 1 3 000 hr/failure 


C. 323 000 hr/failure 


What is the failure rate? 

A. 3.1x10"* failure/hr B. 3.5x10"* failure/hr 


C. 4.0x10"* failure/hr 


What is the reliability? 

A. 0.68 B 0 78 

What is the availability? 

A. 0.79 B. 0.89 


C. 0.88 


C. 0.99 


13 Answers are given at the end of this manual. 
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Appendix A 

Reliability Information 


The figures and tables in this appendix provide reference 
data to support chapters 2 to 6. For the most part these data are 
self-explanatory. 

Figure A-l contains operating failure rates for military 
standard parts. They relate to electronic, electromechanical, 
and some mechanical parts and are useful in making approx- 
imate reliability predictions as discussed in chapter 3. Their 
use, limitations, and validity are explained in chapter 4. 

Figure A-2 provides failure rate information for making 
approximate reliability predictions for systems that use estab- 
lished-reliability parts, such as air- and ground-launched 
vehicles, airborne and critical ground support equipment, 
piloted aircraft, and orbiting satellites. The use of this figure is 
discussed in chapter 4. 

Figure A-3 shows the relationship of operating application 
factor to nonoperating application factor. These data can be 
used to adjust failure rates for the mission condition. The use of 
this figure is also discussed in chapter 4. 

Figure A— 4 contains reliability curves for interpreting the 
results of attribute tests. They provide seven confidence levels, 
from 50 percent to 99 percent; and six test failure levels, 
from 0 to 5 failures. The use of these figures is discussed in 
chapter 5. 

Table A-l contains values of the negative exponential func- 
tion e ~ x , where —x varies from 0 to —0. 1999. The tabulated data 
make it easy to look up the reliability, where the product of 


failure rate 1 (or 1/MTBF) and operating time t are substituted 
for -jc. The use of this table is discussed in chapter 3 and it is 
frequently referred to in chapters 4 to 6. 

Table A-2 contains tolerance factors for calculating the 
results of mean-time-between-failure tests. It provides seven 
confidence levels, from 50 to 99 percent for 0 to 15 observed 
failures. The use of this table is explained in the table. Examples 
are discussed in chapter 6. 

Tables A-3 to A-5 contain tabulated data for safety margins, 
probability, sample size, and test-demonstrated safety margins 
for tests to failure. They provide three confidence levels, from 
90 to 99 percent, and sample sizes from 5 to 1 00. Values similar 
to these are presented on the safety margin side of the reliability 
slide rule; the slide rule provides six confidence levels and 
sample sizes from 5 to 80. The use of these tables and the slide 
rule is discussed in chapter 6. 

More information on this subject can be found in references 
A-l and A-2. 
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Figure A-2-— High-reliability catastrophic failure rates for operating mode. Failure rate for these parts in non- 
operating mode is about a factor of 10 less than values shown. (From ref. A-1). 
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TABLE A- 1.— VALUES OF NEGATIVE EXPONENTIAL FUNCTION 


0.0000 

.0001 

.0002 

.0003 

.0004 

0.0005 

.0006 

.0007 

.0008 

.0009 

0.0010 

.0011 

.0012 

.0013 

.0014 


1.00000 

.99990 

.99980 

.99970 

.99960 

0.99950 

.99940 

.99930 

.99920 

.99910 

0.99900 

.99890 

.99880 

.99870 

.99860 


| 0.0050 
.0051 
.0052 
.0053 
.0054 

| 0.0055 
.0056 
.0057 
.0058 
.0059 

| 0.0060 
.0061 
.0062 
.0063 
.0064 


0.99501 

.99491 

.99481 

.99471 

.99461 

0.99452 

.99442 

.99432 

.99422 

.99412 

0.99402 

.99392 

.99382 

.99372 

.99362 


0.0100 
0101 
.0102 
.0103 
.0104 

0.0105 

.0106 

0107 

.0108 

.0109 

0.0110 

.0111 

.0112 

.0113 

.0114 


0.0015 0.99850 0.0065 0.99352 0.0115 

.0016 .99840 .0066 .99342 .0116 

.0017 .99830 .0067 .99332 .0117 

.0018 .99820 .0068 .99322 .0118 

.0019 .99810 .0069 .99312 .0119 


0.0020 

.0021 

.0022 

.0023 

.0024 

0.0025 

.0026 

.0027 

.0028 

.0029 

0.0030 

.0031 

.0032 

.0033 

.0034 

0.0035 

.0036 

.0037 

.0038 

.0039 

0.0040 

.0041 

.0042 

.0043 

.0044 


0.99800 

.99790 

.99780 

.99770 

.99760 

0.99750 

.99740 

.99730 

.99720 

.99710 

0.99700 

.99690 

.99681 

.99671 

.99661 

0.99651 
.99641 
.99631 
.99621 
.99611 

0.99601 
.99591 
.99581 
.99571 
.99561 


0.0045 0.99551 

0046 .99541 


.0047 

.0048 

.0049 


.99531 

.99521 

.99511 


I 0.0070 
.0071 
.0072 
.0073 
.0074 

j 0.0075 
.0076 
.0077 
.0078 
.0079 

: 0.0080 
.0081 
.0082 
.0083 
.0084 

0.0085 

.0086 

.0087 

.0088 

.0089 

0.0090 

0091 

.0092 

.0093 

.0094 


0.99302 
.99293 
.99283 
.99273 
.99263 

0.99253 
.99243 
.99233 
.99223 
.99213 

0.99203 
.99193 
.99183 
.99173 
.99164 

0.99154 

.99144 

.99134 

.99124 

.99114 

0.99104 

.99094 

.99084 

.99074 

.99064 


0.0095 0.99054 

.0096 .99045 


.0097 

0098 

.0099 


.99035 

.99025 

.99015 


| 0.0120 
.0121 
.0122 
.0123 
.0124 

| 0.0125 
.0126 
.0127 
.0128 
.0129 

| 0.0130 
.0131 
.0132 
.0133 
.0134 

| 0.0135 
.0136 
.0137 
.0138 
.0139 

0.0140 

.0141 

.0142 

.0143 

.0144 


0.99005 
.98995 
.98985 
.98975 
.98965 

0.98955 
.98946 
.98936 
.98926 
.98916 

0.98906 
.98896 
.98886 
.98876 
.98866 

0.98857 
.98847 
.98837 
.98827 
.98817 

0.98807 
.98797 
.98787 
.98777 
.98767 

0.98757 
.98747 
.98738 
.98728 
.98718 

0.98708 
.98699 
.98689 
.98679 
.98669 

0.98659 
.98649 
.98639 
.98629 
.98620 


0.0150 0.98511 

.0151 .98501 

0152 .98491 

0153 .98482 

0154 .98472 


0.0155 

.0156 

.0157 

.0158 

.0159 

| 0.0160 
.0161 
.0162 
.0163 
.0164 


0.98462 

.98452 

.98442 

.98432 

.98423 

0.98413 

.98403 

.98393 

.98383 

.98373 


1 0.0165 0.98364 

.0166 . 98354 

.0167 .98344 

0168 .98334 

0169 .98324 


0.98610 

.98600 

.98590 

.98580 

.98570 


0.0145 0.98560 

.0146 .98551 

.0147 .98541 

0148 .98531 

.0149 .98521 


0.0170 
.0171 
.0172 
.0173 
.0174 

0.0175 

.0176 

.0177 

.0178 

.0179 

0.0180 

.0181 

.0182 

.0183 

.0184 

0.0185 

.0186 

.0187 

.0188 

.0189 


0.0190 

.0191 

.0192 

.0193 

.0194 


0.98314 

.98305 

.98295 

.98285 

.98275 

0.98265 
I .98255 
I .98246 
.98236 
.98226 

0.98216 

.98206 

.98196 

.98187 

.98177 

0.98167 
.98157 
I .98147 
.98138 
.98128 

0.98118 

.98108 

.98098 

.98089 

.98079 


0.0200 
.0201 
.0202 
.0203 
.0204 

0.0205 
.0206 
.0207 
.0208 
.0209 

0.0210 
.0211 
.0212 
.0213 
.0214 

0.0215 
.0216 
.0217 
.0218 
.0219 

0.0220 
.0221 
.0222 
.0223 
.0224 

0.0225 
.0226 
.0227 
.0228 
.0229 

0.0230 
.0231 
.0232 
.0233 
.0234 

0.0235 
.0236 
.0237 
.0238 
.0239 


0.0240 

.0241 

.0242 

.0243 

.0244 


0.0195 0.98069 0.0245 

0196 .98059 .0246 

.0197 .98049 .0247 

0198 .98039 .0248 

0199 .98030 .0249 


0.98020 

.98010 

.98000 

.97990 

.97981 

0.97971 

.97961 

.97951 

.97941 

.97932 

0.97922 

.97912 

.97902 

.97893 

.97883 

0.97873 

.97863 

.97853 

.97844 

.97834 

0.97824 

.97814 

.97804 

.97795 

.97785 

0.97775 

.97765 

.97756 

.97746 

-97736 

0.97726 

.97716 

.97707 

.97697 

.97687 

0.97677 
.97668 
.97658 
.97648 
.97638 


0.0250 
.0251 
.0252 
.0253 
.0254 

0.0255 
.0256 
.0257 
.0258 
.0259 

0.0260 
.0261 
.0262 
.0263 
.0264 

0.0265 
.0266 
.0267 
.0268 
.0269 

| 0.0270 
.0271 
.0272 
.0273 
.0274 

| 0.0275 
.0276 
.0277 
.0278 
.0279 

| 0.0280 
.0281 
.0282 
.0283 
.0284 

| 0.0285 
.0286 
.0287 
.0288 
.0289 


e~* 


0.97629 

.97619 

.97609 

.97599 

.97590 


I 0.0290 
.0291 
.0292 
.0293 
.0294 


0.97531 

.97521 

.97511 

.97502 

.97492 

0.97482 

.97472 

.97463 

.97453 

.97443 

0.97434 

.97424 

.97414 

.97404 

.97395 

0.97385 

.97375 

.97365 

.97356 

.97346 

0.97336 
.97326 
.97317 
.97307 
.97297 

0.97287 
.97278 
.97268 
.97258 
.97249 

0.97239 
.97229 
.97219 
.97210 
.97200 

0.97190 
.97181 
.97171 
.97161 
.97151 


0.97580 0.0295 

.97570 .0296 


.97560 

.97550 

.97541 


.0297 

.0298 

.0299 


0.97142 

.97132 

.97122 

.97113 

.97103 

0.97093 

.97083 

.97074 

.97064 

.97054 
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TABLE A- 1. -Continued. 


p 

JC 

e~ x 

X 

t~ x 

X 

e~ x 

JC 

e~ x 

X 

e~* 

X 



D.0300 

.0301 

.0302 

.0303 

.0304 

0.0305 

.0306 

.0307 

.0308 

.0309 

0.0310 

.0311 

.0312 

.0313 

.0314 

0.0315 

.0316 

.0317 

.0318 

,0319 

0.0320 

.0321 

.0322 

.0323 

.0324 

0.0325 

.0326 

.0327 

.0328 

.0329 

0.0330 

.0331 

.0332 

.0333 

.0334 

0.0335 

.0336 

.0337 

.0338 

.0339 

0.0340 

.0341 

.0342 

.0343 

.0344 

0.0345 

.0346 

.0347 

.0348 

.0349 

0.97045 

.97035 

.97025 

.97015 

.97006 

0.96996 

.96986 

.96977 

.96967 

.96957 

0.96948 

.96938 

.96928 

.96918 

.96909 

0.96899 

.96889 

.96879 

.96870 

.96860 

0.96851 

.96841 

.96831 

.96822 

.96812 

0.96802 

.96793 

.96783 

.96773 

.96764 

0.96754 

.96744 

.96735 

.96725 

.96715 

0.96705 

.96696 

.96686 

.96676 

.96667 

0.96657 

.96647 

.96638 

.96628 

.96618 

0.96609 

.96599 

.96590 

.96580 

.96570 

0.0350 

.0351 

.0352 

.0353 

.0354 

0.0355 

.0356 

.0357 

.0358 

.0359 

0.0360 

.0361 

.0362 

.0363 

.0364 

0.0365 

.0366 

.0367 

.0368 

.0369 

0.0370 

.0371 

.0372 

.0373 

.0374 

0.0375 

.0376 

.0377 

.0378 

.0379 

0.0380 

.0381 

.0382 

.0383 

.0384 

0.0385 

.0386 

.0387 

.0388 

.0389 

0.0390 

.0391 

.0392 

.0393 

.0394 

0.0395 

.0396 

.0397 

.0398 

.0399 

0.96561 

.96551 

.96541 

.96531 

.96522 

0.96512 

.96503 

.96493 

.96483 

.96474 

0.96464 

.96454 

.96445 

.96435 

.96425 

0.96416 

.96406 

.96397 

.96387 

.96377 

0.96368 

.96358 

.96348 

.96339 

.96329 

0.96319 

.96310 

.96300 

.96291 

.96281 

0.96271 

.96262 

.96252 

.96242 

.96233 

0.96223 

.96214 

.96204 

.96194 

.96185 

0.96175 

.96165 

.96156 

.96146 

.96137 

0.96127 

.96117 

.96108 

.96098 

.96089 

0.0400 

.0401 

.0402 

.0403 

.0404 

0.0405 

.0406 

.0407 

.0408 

.0409 

0.0410 

.0411 

.0412 

.0413 

.0414 

0.0415 

.0416 

.0417 

.0418 

.0419 

0.0420 

.0421 

.0422 

.0423 

.0424 

0.0425 

.0426 

.0427 

.0428 

.0429 

0.0430 
.0431 
| .0432 

.0433 
.0434 

0.0435 

.0436 

.0437 

.0438 

.0439 

0.0440 

.0441 

.0442 

.0443 

.0444 

0.0445 

.0446 

.0447 

.0448 

.0449 

0.96079 

.96069 

.96060 

.96050 

.96041 

0.96031 

.96021 

.96012 

.96002 

.95993 

0.95983 

.95973 

.95964 

.95954 

.95945 

0.95935 

.95925 

.94916 

.95906 

.95897 

0.95887 

.95877 

.95868 

.95858 

.95849 

0.95839 

.95829 

.95820 

.95810 

.95801 

0.95791 

.94782 

.95772 

.95762 

.95753 

0.95743 

.95734 

.95724 

.95715 

.95705 

0.95695 

.95686 

.95676 

.95667 

.95657 

0.95648 

.95638 

.95628 

.95619 

.95609 

0.0450 

.0451 

.0452 

.0453 

.0454 

0.0455 

.0456 

.0457 

.0458 

.0459 

0.0460 

.0461 

.0462 

.0463 

.0464 

0.0465 

.0466 

.0467 

.0468 

.0469 

0.0470 

.0471 

.0472 

.0473 

.0474 

0.0475 
.0476 
.0477 
.0478 
1 .0479 

0.0480 

.0481 

.0482 

.0483 

.0484 

0.0485 

.0486 

.0487 

.0488 

.0489 

0.0490 

.0491 

.0492 

.0493 

.0494 

0.0495 

.0496 

.0497 

.0498 

.0499 

0.95600 

.95590 

.95581 

.95571 

.95562 

0.95552 

.95542 

.95533 

.95523 

.95514 

0.95504 

.95495 

.95485 

.95476 

.95466 

0.95456 

.95447 

.95437 

.95428 

.95418 

0.95409 

.95399 

.95390 

.95380 

.95371 

0.95361 

.95352 

.95342 

.95332 

.95323 

0.95313 

.95304 

.95294 

.95285 

.95275 

0.95266 

.95256 

.95247 

.95237 

.95228 

0.95218 

.95209 

.95199 

.95190 

.95180 

0.95171 

.95161 

.95151 

.95142 

.95132 

0.0500 

.0501 

.0502 

.0503 

.0504 

0.505 

.0506 

.0507 

.0508 

.0509 

0.0510 

.0511 

.0512 

.0513 

.0514 

0.0515 

.0516 

.0517 

.0518 

.0519 

0.0520 

.0521 

.0522 

.0523 

.0524 

0.0525 

.0526 

.0527 

.0528 

.0529 

0.0530 

.0531 

.0532 

.0533 

.0534 

0.0535 

.0536 

.0537 

.0538 

.0539 

0.0540 

.0541 

.0542 

.0543 

.0544 

0.0545 

.0546 

.0547 

.0548 

.0549 

0.95123 

.95113 

.95104 

.95094 

.95085 

0.95075 

.95066 

.95056 

.95047 

.95037 

0.95028 

.95018 

.95009 

.94999 

.94990 

0.94980 

.94971 

.94961 

.94952 

.94942 

0.94933 

.94923 

.94914 

.94904 

.94895 

0.94885 

.94876 

.94866 

.94857 

.94847 

0.94838 

.94829 

.94819 

.94810 

.94800 

0.94791 

.94781 

.94772 

.94762 

.94753 

0.94743 

.94734 

.94724 

.94715 

.94705 

0.946% 

.94686 

.94677 

.94667 

.94658 

0.0550 

.0551 

.0552 

.0553 

.0554 

0.0555 

.0556 

.0557 

.0558 

.0559 

0.0560 

.0561 

.0562 

.0563 

.0564 

0.0565 

.0566 

.0567 

.0568 

.0569 

0.0570 

.0571 

.0572 

.0573 

.0574 

0.0575 

.0576 

.0577 

.0578 

.0579 

0.0580 

.0581 

.0582 

.0583 

.0584 

0.0585 

.0586 

.0587 

.0588 

.0589 

0.0590 

.0591 

.0592 

.0593 

.0594 

0.0595 

.0596 

.0597 

.0598 

.0599 

0.94649 

.94639 

.94630 

.94620 

.94611 

0.94601 

.94592 

.94582 

.94573 

.94563 

0.94554 

.94544 

.94535 

.94526 

.94516 

0.94507 

.94488 

.94488 

.94478 

.94469 

0.94450 

.94450 

.94441 

.94431 

.94422 

0.94412 

.94403 

.94393 

.94384 

.94374 

0.94365 

.94356 

.94346 

.94337 

.94327 

0.94318 

.94308 

.94299 

.94289 

.94280 

0.94271 

.94261 

.94252 

.94242 

.94233 

0.94224 

.94214 

.94205 

.94195 

.94186 
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0.0600 
.0601 
.0602 
.0603 
.0604 

0.0605 

.0606 

.0607 

.0608 

.0609 

0.0610 

.0611 

.0612 

.0613 

.0614 

0.0615 

.0616 

.0617 

.0618 

.0619 

0.0620 
.0621 
.0622 
.0623 
.0624 

0.0625 
.0626 
.0627 
.0628 
.0629 

0.0630 
0631 
.0632 
.0633 
.0634 


0.0635 

.0636 

.0637 

.0638 

.0639 

0.0640 

.0641 

.0642 

.0643 

.0644 

0.0645 

.0646 

.0647 

.0648 

.0649 


0.94176 

.94167 

.94158 

.94148 

.94139 

0.94129 

.94120 

.94111 

.94101 

.94092 

0.94082 

.94073 

.94064 

.94054 

.94045 

0.94035 
.94026 
.94016 
.94007 
.93998 

0.93988 

.93979 

.93969 

.93960 

.93951 

0.93941 

.93932 

.93923 

.93913 

.93904 

0.93894 
.93885 
.93876 
.93866 
.93857 


0.93847 

.93838 

.93829 

.93819 

.93810 

0.93800 

.93791 

.93782 

.93772 

.93763 

0.93754 

.93744 

.93735 

.93725 

.93716 


0.0650 
.0651 
.0652 
.0563 
.0654 

| 0.0655 
.0656 
.0657 
.0658 
.0659 

[ 0.0660 
.0661 
.0662 
.0663 
.0664 

[ 0.0665 

.0666 

.0667 

.0668 

.0669 

| 0.0670 
.0671 
.0672 
.0673 
.0674 

[ 0.0675 
.0676 
.0677 | 
.0678 
.0679 

| 0.0680 
.0681 
.0682 
.0683 
.0684 


0.0685 

.0686 

.0687 

.0688 

I .0689 

0.0690 

.0691 

.0692 

.0693 

.0694 

0.0695 

.0696 

.0697 

.0698 

.0699 


0.93707 

.93697 

.93688 

.93679 

.93669 

0.93660 

.93651 

.93641 

.93632 

.93622 

0.93613 

.93604 

.93594 

.93585 

.93576 

0.93566 
.93557 
.93548 
.93538 
93529 

0.93520 

.93510 

-93501 

.93491 

.93482 

0.93473 
93463 
.93454 
.93445 
.93435 

0.93425 
.93417 
.93407 
.93398 
.93389 


0.93379 

.93370 

.93361 

.93351 

.93342 

0.93333 

.93323 

.93314 

.93305 

.93295 

0.93286 

.93277 

.93267 

.93258 

.93249 


0.0700 
.0701 
.0702 
.0703 
.0704 

0.0705 
.0706 
.0707 
.0708 
.0709 

0.0710 
.0711 
.0712 
.0713 
.0714 

0.0715 
.0716 
.0717 
.0718 
.0719 

0720 

.0721 

.0722 

.0723 

0724 

0725 
.0726 
.0727 
.0728 
.0729 

0.0730 
.0731 
.0732 
.0733 
.0734 


0,0735 

.0736 

.0737 

,0738 

.0739 

0.0740 

.0741 

.0742 

.0743 

.0744 

0.0745 
.0746 
.0747 
.0748 
.0749 i 


0.93239 

.93230 

.93221 

.93211 

.93202 

0.93193 

.93183 

.93174 

.93165 

.93156 

0.93146 

.93137 

.93128 

.93118 

.93109 

0.93100 

.93090 

.93081 

.93072 

.93062 

0.93053 

.93044 

.93034 

.93025 

.93016 

0.93007 

.92997 

.92988 

.92979 

.92969 

0.92960 
.92951 
.92941 
.92932 
.92923 


0.0750 
.0751 
.0752 
.0753 
.0754 

0.0755 
.0756 
.0757 
.0758 
.0759 

0.0760 
.0761 
.0762 
.0763 
.0764 

[ 0.0765 
.0766 
.0767 
.0768 
.0769 

| 0.0770 
.0771 
.0772 
.0773 
.0774 

| 0.0775 
.0776 
.0777 
.0778 
.0779 


0.92914 

.92904 

.92895 

-92886 

.92876 

0.92867 

.92858 

.92849 

.92839 

.92830 

0.92921 

.92811 

.92802 

.92793 

.92784 


0.0780 

.0781 

.0782 

.0783 

.0784 


0.0785 

.0786 

.0787 

.0788 

.0789 

0.0790 

.0791 

.0792 

.0793 

.0794 

0.0795 

. 07 % 

.0797 

.0798 

.0799 


0.92774 

.92765 

.92756 

.92747 

.92737 

0.92728 

.92719 

.92709 

.92700 

.92691 

0.92682 

.92672 

.92663 

.92654 

.92645 

0.92635 

.92626 

.92617 

.92608 

.92598 

0.92589 

.92580 

.92570 

.92561 

.92552 

0.92543 
.92533 
.92524 
.92515 
.92506 


0.92496 
.92487 
.92478 
.92469 | 
.92459 

0.92450 | 
.92441 
.92432 
.92422 
.92413 

0.92404 

.92395 

.92386 

.92376 

.92367 

0.92358 

.92349 

.92339 

.92330 

.92321 


0.0800 
.0801 
.0802 
.0803 
.0804 

0.0805 
.0806 
.0807 
.0808 
.0809 

0.0810 
.0811 
.0812 
.0813 
.0814 

0.0815 

.0816 

.0817 

.0818 

.0819 

0.0820 

.0821 

.0822 

.0823 

.0824 

0.0825 

.0826 

.0827 

.0828 

.0829 

0.0830 
.0831 
.0832 
.0833 
.0834 

0.0835 
.0836 
.0837 
.0838 
.0839 


0.0840 

.0841 

.0842 

.0843 

.0844 

0.0845 

.0846 

.0847 

.0848 

.0849 


e~ x 

0.92312 
.92302 
.92293 
.92284 
.92275 

0.92265 
.92256 
.92247 
.92238 
.92229 

0.92219 
.92210 
.92201 
.92191 
.92182 

0.92173 

.92164 

.92155 

.92146 

.92136 

0.92127 

.92118 

.92109 

.92100 

.92090 

0.92081 

.92072 

.92063 

.92054 

.92044 

0.92035 
.92026 
.92019 
.92008 
.91998 

0.91989 
.91980 
.91971 
.91962 
.91952 


0.91943 

.91934 

.91925 

.91916 

.91906 

0.91897 

.91888 

.91879 

.91870 

.91860 


0.0850 
.0851 
.0852 
.0853 
.0854 

0.0855 
.0856 
.0857 
.0858 
.0859 

0.0860 
0861 
0862 
.0863 
.0864 

0.0865 

.0866 

.0867 

.0868 

.0869 

0.0870 
0871 
.0872 
.0873 
.0874 

0.0875 
.0876 
.0877 
.0878 
.0879 

0.0880 
.0881 
.0882 
.0883 
.0884 

0.0885 
.0886 
.0887 
.0888 
.0889 


0.0890 

.0891 

.0892 

.0893 

.0894 

0.0895 

. 08 % 

.0897 

.0898 

.0899 


0.91851 
.91842 
.91833 
.91824 
.91814 

0.91805 
. 917 % 
.91787 
.91778 
.91769 

0.91759 
.91750 
.91741 
.91732 
.91723 

0.91714 

.91704 

.91695 

.91686 

.91677 

0.91668 
.91659 
.91649 
.91640 
.91631 

0.91622 
.91613 
.91604 
.91594 
.91585 

0.91576 
.91567 
.91558 
.91549 
.91539 

0.91530 
.91521 
.91512 
.91503 
.91494 


0.91485 

.91475 

.91466 

.91457 

.91448 

0.91439 

.91430 

.91421 

.91411 

.91402 
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X 

e~ x 

0.0900 

0.91393 

.0901 

.91384 

.0902 

.91375 

.0903 

.91366 

.0904 

.91357 

0.0905 

0.91347 

.0906 

.91338 

.0907 

.91329 

.0908 

.91320 

.0909 

.91311 

0.0910 

0.91302 

.0911 

.91293 

.0912 

.91284 

.0913 

.91274 

.0914 

.91265 

0.0915 

0.91256 

.0916 

.91247 

.0917 

.91238 

.0918 

.91229 

,0919 

.91220 

0.0920 

0.92111 

.0921 

.91201 

.0922 

.91192 

.0923 

.91183 

.0924 

.91174 

0.0925 

0.91165 

.0926 

.91156 

.0927 

.91147 

.0928 

.91138 

.0929 

.91128 

0.0930 

0.91119 

.0931 

.91110 

.0932 

.91 101 

.0933 

.91092 

.0934 

.91083 

0.0935 

0.91074 

.0936 

.91065 

.0937 

.91056 

.0938 

.91046 

.0939 

.91037 

0.0940 

0.91028 

.0941 

.91019 

.0942 

.91010 

.0943 

.91001 

.0944 

.90992 

0.0945 

0.90983 

.0946 

.90974 

.0947 

.90965 

.0948 

.90955 

.0949 

.90946 


X 

e~ x 

0.0950 

0.90937 

.0951 

.90928 

.0952 

.90919 

.0953 

.90910 

.0954 

.90901 

0.0955 

0.90892 

.0956 

.90883 

.0957 

.90874 

.0958 

.90865 

.0959 

.90855 

0.0960 

0.90846 

.0961 

.90837 

.0962 

.90828 

.0963 

.90819 

.0964 

.90810 

0.0965 

0.90801 

.0966 

.90792 

.0967 

.90783 

.0968 

.90774 

.0969 

.90765 

0.0970 

0.90756 

.0971 

.90747 

.0972 

.90737 

.0973 

.90728 

.0974 

.90719 

0.0975 

0.90710 

.0976 

.90701 

.0977 

.90692 

.0978 

.90683 

.0979 

.90674 

0.0980 

0.90665 

.0981 

.90656 

.0982 

.90647 

.0983 

.90638 

.0984 

.90629 

0.0985 

0.90620 

.0986 

.90611 

.0987 

.90601 

.0988 

.90592 

.0989 

.90583 

0.0990 

0.90574 

.0991 

.90565 

.0992 

.90556 

.0993 

.90547 

.0994 

.90538 

0.0995 

0.90529 

.0996 

.90520 

.0997 

.90501 

.0998 

.90502 

.0999 

.90493 


X 


0.1000 

0.90484 

.1001 

.90475 

.1002 

.90466 

.1003 

.90457 

.1004 

.90448 

0.1005 

0.90439 

.1006 

.90429 

.1007 

.90420 

.1008 

.90411 

.1009 

.90402 

0.1010 

0.90393 

.1011 

.90384 

.1012 

.90375 

.1013 

.90366 

.1014 

.90357 

0.1015 

0.90348 

.1016 

.90339 

.1017 

.90330 

.1018 

.90321 

.1019 

.90312 

0.1020 

0.90303 

.1021 

.90294 

.1022 

.90285 

.1023 

.90276 

.1024 

.90267 

0.1025 

0.90258 

.1026 

.90249 

.1027 

.90240 

.1028 

.90231 

.1029 

.90222 

0.1030 

0.90213 

.1031 

.90204 

.1032 

.90195 

.1033 

.90186 

.1034 

.90177 

0.1035 

0.90168 

.1036 

.90159 

.1037 

.90150 

.1038 

.90141 

.1039 

.90132 

0.1040 

0.90123 

.1041 

.90114 

.1042 

.90105 

.1043 

.90095 

.1044 

.90086 

0.1045 

0.90077 

.1046 

.90068 

.1047 

.90059 

.1048 

.90050 

.1049 

.90041 


X 

e~ x 

0.1050 

0.90032 

.1051 

.90023 

.1052 

.90014 

.1053 

.90005 

.1054 

.89996 

0.1055 

0.89987 

.1056 

.89978 

.1057 

.89969 

.1058 

.89960 

.1059 

.89951 

0.1060 

0.89942 

.1061 

.89933 

.1062 

.89924 

.1063 

.89915 

.1064 

.89906 

0.1065 

0.89898 

.1066 

.89889 

.1067 

.89880 

.1068 

.89871 

.1069 

.89862 

0.1070 

0.89853 

.1071 

.89844 

.1072 

.89835 

.1073 

.89826 

.1074 

.89817 

0.1075 

0.89808 

.1076 

.89799 

.1077 

.89790 

.1078 

.89781 

.1079 

.89772 

0.1080 

0.89763 

.1081 

.89754 

.1082 

.89745 

.1083 

.89736 

.1084 

.89727 

0.1085 

0.89718 

.1086 

.89709 

.1087 

.89700 

.1088 

.89691 

.1089 

.89682 

0.1090 

0.89673 

.1091 

.89664 

.1092 

.89655 

.1093 

.89646 

.1094 

.89637 

0.1095 

0.89628 

.1096 

.89619 

.1097 

.89610 

.1098 

.89601 

.1099 

.89592 


X 

e~* 

0.1100 

0.89583 

.1101 

.89574 

.1102 

.89565 

.1103 

.89557 

.1004 

.89548 ! 

0.1105 

0.89539 

.1106 

.89530 

.1107 

.89521 

.1108 

.89512 

.1109 

.89503 

0.1110 

0.89494 

.1111 

.89485 

.1112 

.89476 

.1113 

.89467 

.1114 

.89458 

0.1115 

0.89449 

.1116 

,89440 

.1117 

.89431 

.1118 

.89422 

.1119 

.89413 

0.1120 

0.89404 

.1121 

.89395 

.1122 

.89387 

.1123 

.89378 

.1124 

.89369 

0.1125 

0.89360 

.1126 

.89351 

.1127 

.89342 

.1128 

.89333 

.1129 

.89324 

0.1130 

0.89315 

.1131 

.89306 

.1132 

.89297 

.1133 

.89288 

.1134 

.89279 

0.1135 

0.89270 

.1136 

.89261 

.1137 

.89253 

.1138 

.89244 

.1139 

.89235 

0.1140 

0.89226 

.1141 

.89217 

.1142 

.89208 

.1143 

.89199 

.1144 

.89190 

0.1145 

0.89181 

.1146 

.89172 

.1147 

.89163 

.1148 

.89154 

.1149 

.89146 


X 

e~ x 

0.1150 

0.89137 

.1151 

.89128 

.1152 

.89119 

.1153 

.89110 

.1154 

.89101 

0.1155 

0.89092 

.1156 

.89083 

.1157 

.89074 

.1158 

.89065 

.1159 

.89056 

0.1160 

0.89048 

.1161 

.89039 

.1162 

.89030 

.1163 

.89021 

.1164 

.89012 

0.1165 

0.89003 

.1166 

.88994 

.1167 

.88985 

.1168 

.88976 

.1169 

.88967 

0.1170 

0.88959 

.1171 

.88950 

.1172 

.88941 

.1173 

.88932 

.1174 

.88923 

0.1175 

0.88914 

.1176 

.88905 

.1177 

.88896 

.1178 

.88887 

.1179 

.88878 

0.1180 

0.88870 

.1181 

.88861 

.1182 

.88852 

.1183 

.88843 

.1184 

.88834 

0.1185 

0.88825 

.1186 

.88816 

.1187 

.88807 

.1188 

.88799 

.1189 

.88790 

0.1190 

0.98781 

.1191 

.88772 

.1192 

.88763 

.1193 

.88754 

.1194 

.88745 

0.1195 

0.88736 

.1196 

.88728 

.1197 

.88719 

.1198 

.88710 

.1199 

.88701 
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0.1200 
.1201 
.1202 
.1203 
.1204 

0.1205 
.1206 
.1207 
.1208 
.1209 

0.1210 
.1211 
.1212 
.1213 
.1214 

0.1215 

.1216 

.1217 

.1218 

.1219 

0.1220 

.1221 

.1222 

.1223 

.1224 

0.1225 

.1226 

.1227 

.1228 

.1229 

0.1230 
.1231 
.1232 
.1233 
.1234 

0.1235 
.1236 
.1237 
.1238 
.1239 

0.1240 
.1241 
.1242 
.1243 
.1244 

0.1245 
.1246 
.1247 
.1248 
.1249 


0.88692 
.88683 
.88674 
.88665 
.88657 

0.88648 
.88639 
.88630 
.88621 
.88612 

0.88603 
.88595 
. 8858 6 
.88577 
.88568 

0.88559 

.88550 

.88541 

.88533 

.88524 

0.88515 

.88506 

.88497 

.88488 

.88479 

0,88471 

.88462 

,88453 

.88444 

.88435 

0.88426 
.88418 
.88409 
.88400 
.88391 

0.88382 
.88373 
.88364 
.88356 
.88347 

0.88338 
.88329 
.88320 
.88311 
.88303 

0.88294 
.88285 
.88276 
.88267 
.88256 


0.1250 
.1251 
.1252 
.1253 
.1254 

0.1255 
.1256 
.1257 
.1258 
.1259 

| 0.1260 
.1261 
.1262 
.1263 
.1264 

| 0.1265 
.1266 
.1267 
.1268 
.1269 

| 0.1270 
.1271 
.1272 
.1273 
.1274 

| 0.1275 
.1276 
.1277 
.1728 
.1279 

| 0.1280 
.1281 
.1282 
.1283 
.1284 | 

| 0.1285 
.1286 
.1287 
.1288 
.1289 

I 0.1290 
.1291 
.1292 
.1293 
.1294 

| 0.1295 
.1296 
.1297 
1298 
.1299 


0.88250 

.88241 

.88232 

.88223 

.88214 

0.88206 

.88197 

.88188 

.88179 

.88170 


0.1300 

.1301 

.1302 

.1303 

.1304 

0.1305 

.1306 

.1307 

.1308 

1309 


0.88161 0.1310 

•88153 .1311 

.88144 .1312 

88135 .1313 

•88126 .1314 


0.88117 

.88109 

.88100 

.88091 

.88082 

0.88065 

.88056 

.88047 

.88038 

0.88029 

.88021 

.88012 

.88003 

.87994 

0.87985 

.87977 

.87968 

.87959 

.87950 

0.87941 

.87933 

.87924 

.87915 

.87906 

0.87897 
.87889 
.87880 
.87871 
.87862 

0.87853 
.87845 
.87836 
.87827 
.87818 


10.1315 
.1316 
.1317 
.1318 
.1319 

0.1320 
.1321 
.1322 
.1323 
.1324 

0.1325 
.1326 
.1327 
.1328 
.1329 

0.1330 
.1331 
.1332 
.1333 
.1334 

0.1335 | 
.1336 | 
.1337 
.1338 
.1339 

0.1340 

.1341 

.1342 

.1343 

.1344 

0.1345 

.1346 

.1347 

.1348 

.1349 


0.87810 

.87801 

.87792 

.87783 

.87774 

0.87766 

.87757 

.87748 

.87739 

.87731 

0.87722 

.87713 

.87704 

.87695 

.87687 

0.87678 

.87669 

.87660 

.87652 

.87643 

0.87634 
.87625 
.87617 
.87608 
.87599 

0.87590 
.87582 
.87573 
.87564 
.87555 

0.87547 
.87538 
.87529 
.87520 
.87511 

0.87503 
.87494 
.87485 
.87477 
.87468 

0.87459 
.87450 
.87442 
.87433 
.87424 


0.1350 
.1351 
.1352 
.1353 
.1354 

0.1355 
.1356 
.1357 
.1358 
.1359 

0.1360 
.1361 
.1362 
.1363 
.1364 

I 0.1365 
.1366 
.1367 
.1368 
.1369 

I 0.1370 
.1371 
.1372 
.1373 
.1374 

I 0.1375 
.1376 
.1377 
.1378 
.1379 

( 0.1380 
.1381 
.1382 
.1383 
.1384 

1 0.1385 
.1386 
.1387 
.1388 
.1389 


I 0.1390 
.1391 
.1392 
.1393 
.1394 


0.87415 

.87407 

.87398 

.87389 

.87380 


I 0.1395 
.1396 
.1397 
.1398 
.1399 


0.87372 

.87363 

.87354 

.87345 

.87337 

0.87328 

.87319 

.87310 

.87302 

.87283 

0.87284 
.87276 
.87267 
.87258 
.87249 

0.87241 
.87232 
.87223 I 
.87214 
.87206 

0.87197 
.87188 
.87180 
.87171 
87162 

0.87153 

.87145 

87136 

.87127 

.87119 

0.87 no 
.87101 
.87092 
.87084 
.87075 

0.87066 

.87058 

.87049 

.87040 

.87031 


0.87023 

.87014 

.87005 

.86997 

.86988 

0.86979 

.86971 

.86962 

. 86953 , 

.86945 


X 

e~ x 

X 


0.1400 

0.86936 

0 . 1 45 C 

0.86502 

. 140 ! 

.86927 

.1451 

.86494 

.1402 

.86918 

.1452 

.86485 

.1403 

.86910 

.1453 

.86476 

.1404 

.86901 

.1454 

.86468 

0.1405 

0.86892 

0. 1455 

0.86459 

.1406 

.86884 

.1456 

.86450 

.1407 

.86875 

.1457 

.86442 

.1408 

.86866 

1 .1458 

.86433 

.1409 

.86858 

.1459 

.86424 

0.1410 

0.86849 

0.1460 

0.86416 

.1411 

.86840 

.1461 

.86407 

.1412 

.86832 

.1462 

.86398 

.1413 

.86823 

.1463 

.86390 

.1414 

.86814 

.1464 

.86381 

0.1415 

0.86806 

0.1465 

0.86373 

. 1416 

.86797 

.1466 

.86364 

.1417 

.86788 

.1467 

.86355 

.1418 

.86779 

.1468 

. 86347 

.1419 

.86771 

.1469 

.86338 

0.1420 

0.86762 

0. 1470 

0.86329 

.1421 

.86753 

.1471 

.86321 

.1422 

.86745 

.1472 

.86312 

.1423 

.86736 

.1473 

.86304 

.1424 

.86727 

.1474 

.86295 

0.1425 

0.86719 

0.1475 

0.86286 

.1426 

.86710 

.1476 

.86278 

.1427 

.86701 

.1477 

! .86269 

.1428 

.86693 

.1478 

.86260 

.1429 

.86684 

.1479 

.86252 

>.1430 

0.86675 

0.1480 

0.86243 

.1431 

.86667 

.1481 

.86234 

.1432 

.86658 

.1482 

.86226 

.1433 

.86649 

.1483 

.86217 

.1434 

.86641 

.1484 

.86209 

>. 1435 

0.86632 

0. 1485 

0.86200 

.1436 

.86623 

.1486 

.86191 

.1437 

.86615 

.1487 

.86183 

.1438 

.86606 

.1488 

.86174 

.1439 

.86597 

.1489 

.86166 

.1440 

0.86589 

0.1490 

0.86157 

.1441 

.86580 

.1491 

.86148 

.1442 

.86571 

.1492 

.86140 

.1443 

.86563 

.1493 

.86131 

.1444 

.86554 

.1494 

.86122 

.1445 

0.86545 

0.1495 

0.86114 

1446 

.86537 

.1496 

.86105 

1447 

.86528 

.1497 

.86097 

1448 

.86520 

.1498 

.86088 

1449 

.86511 

.1499 

.86079 
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TABLE A-l . — Continued. 


X 

e~ x 

X 

e -x 

.1500 

0.86071 

0.1550 

0.85642 

.1501 

.86062 

.1551 

.85633 

.1502 

.86054 

.1552 

.85624 

.1503 

.86045 

.1553 

.85616 

.1504 

.86036 

.1554 

.85607 

.1505 

0.86028 

0.1555 

0.85599 

.1506 

.86019 

.1556 

.85590 

.1507 

.86010 

.1557 

.85582 

.1508 

.86002 

.1558 

.85573 

.1509 

.85993 

.1559 

.85564 

>.1510 

0.85985 

0.1560 

0.85556 

.1511 

.85976 

.1561 

.85547 

.1512 

.85968 

.1562 

.85539 

.1513 

.85959 

.1563 

.85530 

.1514 

.85950 

.1564 

.85522 

3.1515 

0.85942 

0.1565 

0.85513 

.1516 

.85933 

.1566 

.85505 

.1517 

.85925 

.1567 

.854% 

.1518 

.85916 

.1568 

.85488 

.1519 

.85907 

.1569 

.85479 

0.1520 

0.85899 

0.1570 

0.85470 

.1521 

.85890 

.1571 

.85462 

.1522 

.85882 

.1572 

.85453 

.1523 

.85873 

.1573 

.85445 

.1524 

.85864 

.1574 

.85436 

0.1525 

0.85856 

0.1575 

0.85428 

.1526 

.85847 

.1576 

.85412 

.1527 

.85839 

.1577 

.85411 

.1528 

.85830 

.1578 

.85402 

.1529 

.85822 

.1579 

.85394 

0.1530 

0.85813 

0.1580 

0.85385 

.1531 

.85804 

.1581 

.85376 

.1532 

.85796 

.1582 

.85368 

.1533 

.85787 

.1583 

.85359 

.1534 

.85779 

.1584 

.85351 

0.1535 

0.85770 

0.1585 

0.85342 

.1536 

.85761 

.1586 

.85334 

.1537 

.85753 

.1587 

.85325 

.1538 

.85744 

.1588 

.85317 

.1539 

.85736 

.1589 

.85308 

0.1540 

0.85727 

0.1590 

0.85300 

.1541 

.85719 

.1591 

.85291 

.1542 

.85710 

.1592 

.85283 

.1543 

.85701 

.1593 

.85274 

.1544 

.85693 

.1594 

.85266 

0.1545 

0.85684 

0.1595 

0.85257 

.1546 

.85676 

.1596 

.85248 

.1547 

.85667 

.1597 

.85240 

.1548 

.85659 

.1598 

.85231 

.1549 

.85650 

.1599 

.85223 


0.1600 

.1601 

.1602 

.1603 

.1604 

0.1605 

.1606 

.1607 

.1608 

.1609 

0.1610 

.1611 

.1612 

.1613 

.1614 

0.1615 

.1616 

.1617 

.1618 

.1619 

0.1620 

.1621 

.1622 

.1623 

.1624 

0.1625 

.1626 

.1627 

.1628 

.1629 

0.1630 

.1631 

.1632 

.1633 

.1634 

i 

0.1635 

.1636 

.1637 

.1638 

.1639 

0.1640 

.1641 

.1642 

.1643 

.1644 

0.1645 

.1646 

.1647 

.1648 

.1649 


0.85214 

.85206 

.85197 

.85189 

.85180 

0.85172 

.85163 

.85155 

.85146 

.85138 

0.85129 

.85121 

.85112 

.85104 

.85095 

0.85087 

.85078 

.85070 

.85061 

.85053 

0.85044 

.85036 

.85027 

.85019 

.85010 

0.85002 

.84993 

.84985 

.84976 

.84968 

0.84959 

.84951 

.84942 

.84934 

.84925 

0.84917 

.84908 

.84900 

.84891 

.84883 

0.84874 

.84866 

.84857 

.84849 

.84840 

0.84832 

.84823 

.84815 

.84806 

.84798 


0.1650 

.1651 

.1652 

.1653 

.1654 

0.1655 

.1656 

.1657 

.1658 

.1659 

0.1660 

.1661 

.1662 

.1663 

.1664 

0.1665 

.1666 

.1667 

.1668 

.1669 

0.1670 

.1671 

.1672 

.1673 

.1674 

0.1675 

.1676 

.1677 

.1678 

.1679 

0.1680 
.1681 
.1682 
.1683 
.1684 

0.1685 

.1686 

.1687 

.1688 

.1689 

0.1690 

.1691 

.1692 

.1693 

.1694 

0.1695 

.1696 

.1697 

.1698 

.1699 


0.84789 0.1700 

.84781 .1701 

.84772 .1702 

.84764 .1703 

.84755 .1704 


0.84747 

.84739 

.84730 

.84722 

.84713 

0.84705 
.84696 
84688 
84679 
.84671 

0.84662 

.84654 

.84645 

.84637 

.84628 

0.84620 

.84611 

.84603 

.84595 

.84586 

0.84578 

.84569 

.84561 

.84552 

.84544 

0.84535 

.84527 

.84518 

.84510 

.84502 

0.84493 

.84485 

.84476 

.84468 

.84459 

0.84451 

.84442 

.84434 

.84426 

.84417 

0.84409 

.84400 

.84392 

.84383 

.84375 


| 0.1705 
.1706 
.1707 
.1708 
.1709 

|0.1710 

.1711 

.1712 

.1713 

.1714 

| 0.1715 
.1716 
.1717 
.1718 
.1719 

| 0 .1720 
.1721 
.1722 
.1723 
.1724 

| 0.1725 
.1726 
.1727 
.1728 
.1729 

| 0.1730 
.1731 
.1732 
.1733 
.1734 

| 0.1735 
.1736 
.1737 
.1738 
.1739 

10.1740 
.1741 
.1742 
.1743 
.1744 

| 0.1745 
.1746 
.1747 
.1748 
.1749 


€~ x 

X 

e ~x 

0.84366 

0.1750 

0.83946 

.84358 

.1751 

.83937 

.84350 

.1752 

.83929 

.84341 

.1753 

.83921 

.84333 

.1754 

.83912 

0.84324 

0.1755 

0.83904 

.84316 

.1756 

.83895 

.84307 

.1757 

.83887 

.84299 

.1758 

.83879 

.842% 

.1759 

.83870 

0.84282 

0.1760 

0.83862 

.84274 

.1761 

.83853 

.84265 

.1762 

.83845 

.84257 

.1763 

.83837 

.84248 

.1764 

.83828 

0.84240 

0.1765 

0.83820 

.84231 

.1766 

.83811 

.84223 

.1767 

.83803 

.84215 

.1768 

.83795 

.84206 

.1769 

.83786 

0.84198 

0.1770 

0.83778 

.84189 

.1771 

.83770 

.84181 

.1772 

.83761 

.84173 

.1773 

.83753 

.84164 

.1774 

.83744 

0.84156 

0.1775 

0.83736 

.84147 

.1776 

.83728 

.84139 

.1777 

.83719 

.84131 

.1778 

.83711 

.84122 

.1779 

.83703 

0.84114 

0.1780 

0.83694 

.84105 

.1781 

.83686 

.84097 

.1782 

.83678 

.84089 

.1783 

.83669 

.84080 

.1784 

.83661 

0.84072 

0.1785 

0.83652 

.84063 

.1786 

.83644 

.84055 

.1787 

.83636 

.84046 

.1788 

.83627 1 

.84038 

.1789 

.83619 

0.84030 

0.1790 

0.83611 

.84021 

.1791 

.83602 

.84013 

.1792 

.83594 

.84004 

.1793 

.83586 

.83996 

.1794 

.83577 

0.83988 

0.1795 

0.83569 

.83979 

.1796 

.83560 

.83971 

.1797 

.83552 

.83962 

.1798 

.83544 

.83954 

.1799 

.83535 
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TABLE A- 1.— Concluded. 


0.1800 
.1801 
.1802 
.1803 
.1804 

0.1805 

.1806 

.1807 

.1808 

.1809 

0.1810 

.1811 

.1812 

.1813 

.1814 

0.1815 
.1816 
.1817 
.1818 
.1819 

0.1820 

.1821 

.1822 

.1823 

.1824 

0.1825 
.1826 
.1827 
.1828 
.1829 

0.1830 
.1831 
.1832 
.1833 
.1834 


0.1835 

.1836 

.1837 

.1838 

.1839 

0.1840 

.1841 

.1842 

.1843 

.1844 

0.1845 

1846 

.1847 

.1848 

1849 


e~ x 

0.83527 
.83519 
.83510 
.83502 
.83494 

0.83485 

.83477 

.83469 

.83460 

.83452 

0.83444 

.83435 

.83427 

.83419 

.83410 

0.83402 

.83393 

.83385 

.83377 

.83368 

0.83360 

.83352 

.83343 

.83335 

.83327 

0.83318 
.83310 
.83302 
.83293 
.83285 

0.83277 
.83268 
.83260 
.83252 
.83244 


0.1850 0.83110 

1851 .83102 

1852 .83094 

1853 .83085 

1854 .83077 

0.1855 0.83069 

1856 .83061 

1857 .83052 

1858 .83044 

1859 .83036 


0.1860 

.1861 

.1862 

1863 

.1864 


0.83027 

.83019 

.83017 

.83002 

.82994 


0.1900 

.1901 

.1902 

.1903 

.1904 

0.1905 

.1906 

.1907 

.1908 

.1909 

0.1910 

.1911 

.1912 

.1913 

.1914 


0.1865 0.82986 0.1915 

1866 .82978 .1916 

1867 .82969 .1917 

1868 .82961 .1918 

1869 .82953 .1919 


0.1870 
.1871 
.1872 
1873 
.1874 

0.1875 
1876 
.1877 
.1878 
1879 

I 0.1880 
.1881 
.1882 
.1883 
.1884 


0.83235 

.83227 

.83219 

.83210 

.83202 

0.83194 

.83185 

.83177 

.83169 

.83160 

0.83152 

.83144 

.83135 

.83127 

.83119 


10.1885 
.1886 
.1887 
.1888 
.1889 

0.1890 

.1891 

.1892 

.1893 

.1894 

0.1895 

.1896 

.1897 

.1898 

.1899 


0.82944 
.82936 
.82928 
.82919 
.82911 

0.82903 
.82895 
.82886 
.82878 
.82870 

0.82861 

.82853 

.82845 

.82837 

.82828 

0.82820 

.82812 

.82803 

.82795 

.82787 

0.82779 

.82770 

.82762 

.82754 

.82746 

0.82737 
.82729 
.82721 
.82712 
.82704 


I 0.1920 
.1921 
.1922 
.1923 
.1924 

| 0.1925 
.1926 
.1927 
.1928 
.1929 

I 0.1930 
.1931 
.1932 
.1933 
.1934 

I 0.1935 
.1936 
.1937 
.1938 
.1939 

0.1940 

.1941 

1942 

.1943 

1944 

0.1945 

.1946 

.1947 

.1948 

.1949 


0.82696 

.82688 

.82679 

.82671 

.82663 

0.82655 

.82646 

.82638 

.82630 

.82622 

0.82613 

.82605 

.82597 

.82588 

.82580 


0.1950 

.1951 

.1952 

.1953 

.1954 

0.1955 

.1956 

.1957 

.1958 

.1959 

0.1960 

.1961 

.1962 

.1963 

.1964 


0.82283 

.82275 

.82267 

.82259 

.82251 

0.82242 

.82234 

.82226 

.82218 

.82209 

0.82201 

.82193 

.82185 

.82177 

.82168 


0.82572 0.1965 0.82160 

82564 .1966 .82152 

82555 .1967 .82144 

82547 .1968 .82135 

82539 .1969 .82127 


0.82531 
.82522 
.82514 
.82506 
.82498 

0.82489 
.82481 
82473 
.82465 
.82456 

0.82448 

.82440 

.82432 

.82423 

.82415 

0.82407 

.82399 

.83391 

.82382 

.82374 

0.82366 

.82358 

.82349 

.82341 

.82333 

0.82325 

.82316 

.82308 

82300 

82392 


0.1970 

.1971 

.1972 

.1973 

.1974 

0.1975 

.1976 

.1977 

.1978 

.1979 

0.1980 

.1981 

.1982 

.1983 

.1984 

0.1985 

.1986 

.1987 

.1988 

.1989 


0.82119 
.82111 
.82103 
.82094 
.82086 

0.82078 
.82070 
.82062 
.82053 
.82045 

0.82037 

.82029 

.82021 

.82012 

.82004 

0.81996 

.81988 

.81980 

.81971 

.81963 


0.1990 0.81955 

1991 .81947 

1992 .81939 

-1993 .81930 

.1994 .81922 


0.1995 

1996 

1997 

1998 
.1999 


0.81914 

.81906 

.81898 

.81889 

.81881 
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TABLE A-2— TOLERANCE FACTORS FOR OBSERVED MTBP 



*To use this table 

n 

1. total ms, hours T - E N fi whore N is ihe ,< h .ail tested. <; » Uk test »me of V,. and n ,s the total number of units «sied 

y*i 

2. Enter table under number of observed failures at desired confidence level to find tolerance factor. 

3. Lower confidence limit of MTEF ■ 77 (Tolerance factor). 
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TABLE A-3 — SAFETY MARGINS AT 99-PERCENT CONFIDENCE LEVEL 


(a) Sample sizes 5 to 12 



SE 


NASA/TP— 2000-207428 






















TABLE A-3— Continued, 
(a) Concluded. 
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TABLE A-3.— Continued, 
(b) Concluded. 


Safety I Probability, 
margin. 


Sample size, N 





7.0952 

7.2752 

7.4553 

7.6356 

7.8159 

7.9964 

8.1771 

8.3578 

8.5386 

8.7195 

8.9005 

9.0816 

9.2628 

9.4440 

9.6253 

9.8067 

9.9881 


10.7145 

10.8962 

11.0779 

11.2598 

11.4416 

11.6235 

11.8054 

11.9874 

12.1694 

12.3514 

12.5335 

12.7156 

12.8978 

12.0799 

13.2621 

13.4443 

13.6266 

13.8088 

13.9911 

14.1734 

14.3558 

14.5381 

14.7205 


6.8849 
7.0596 
7.2344 
7.4094 
7.5845 
7.7597 
7.9351 
8.1105 
8.2861 
8.4617 
8.6374 
8.8132 
8.9890 
9.1650 
9.3410 
9.5170 
9.6932 
9 8694 
10.0456 
10.2219 
10.3983 
10.5746 
10.7511 
10.9276 
11.1041 
11.2806 
11.4572 
11.6339 
11.8105 I 
11.9873 
12.1640 
12.3407 
12.5175 
12.6944 
12.8712 
13.0481 
13.2250 
13.4019 
13.5788 
13.7558 
13.9328 
14.1098 
14.2868 



7.5590 
7.7299 
7.9009 
8.0719 
8.2431 
8.4143 
8.5856 
8.7570 
8.9284 
9.0999 
9.2715 
9.4431 
9.6148 
9.7865 
9.9583 
10.1302 
10.3020 
10.4740 
10.6459 
10.8179 
10.9900 
11.1621 ! 
11.3342 
11.5063 
11.6785 
11.8507 
12.0230 
12.1952 
12.3675 
12.5398 
12.7122 
12.8846 
13.0569 
13.2294 
13.4018 
13.5742 
13.7467 
13.9192 


6.8859 
7.0526 
7.2194 
7.3863 
7.5533 
7.7204 
7.8877 
8.0550 
8.2223 
8.3898 
8.5573 
8.7249 
8.8925 
9.0602 
9.2280 
9.3958 
9.5637 
9.7316 
9.8996 
10.0676 
10.2356 
10.4037 
10.5718 
10.7400 
10.9082 
11.0764 
11.2447 
11.4130 
11.5813 I 
11.7496 
11.9180 
12.0864 
12.2548 
12.4233 
12.5918 
12.7603 
12.9288 
13.0973 
13.2659 
13.4344 
13.6030 


6.4190 
6.5822 
6.7454 
6.9088 
7.0723 
7.2359 
7.3995 
7.5633 
7.7272 
7.8911 
8.0551 
8.2192 
8.3834 
8.5476 
8.7119 
8.8763 
9.0406 
9.2051 
9.3696 
9.5341 
9.6987 
9.8634 
10.0280 
10.1928 
10.3575 I 
10.5223 
10.6871 
10.8520 
11.0168 
11.1817 
11.3467 
11.5116 
11.6766 
11.8416 
12.0067 
12.1717 
12.3368 
12.5019 
12.6671 
12.8322 
12.9974 
13.1626 
13.3278 


6.2990 

6.4591 

6.6194 

6.7798 

6.9403 

7.1010 

7.2617 

7.4225 

7.5833 

7.7443 

7.9053 

8.0664 

8.2276 

8.3888 

8.5501 

8.7114 

8.8728 

9.0843 

9.1958 

9.3573 

9.5189 

9.6805 

9.8422 

10.0039 

10.1656 

10.3274 

10.4892 

10.6510 

10.8129 

10.9748 

11.1367 

11.2986 

11.4606 

11.6226 

11.7846 

11.9466 

12.1087 

12.2708 

12.4329 

12.5950 

12.7571 

12.9193 

13.0814 


6.1925 

6.3501 

6.5077 

6.6655 

6.8234 

6.9814 

7.1394 

7.2976 

7.4558 

7.6141 

7.7725 

7.9310 

8.0895 

8.2480 

8.4067 

8.5654 

8.7241 

8.8829 

9.0417 

9.2006 

9.3595 

9.5184 

9.6774 

9.8365 

9.9955 

10.1546 

10.3138 

10.4729 

10.6321 

10.7913 

10.9505 

11.1098 

11.2691 

11.4284 

11.5877 

11.7471 

11.9065 

12.0659 

12.2253 

12.3847 

12.5442 

12.7036 

12.8631 


NASA/TP— 2000-207428 


17 
















TABLE A-3.— Continued, 
(c) Sample sizes 21 to 28 
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TABLE A-3.— Continued 
(c) Concluded. 
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TABLE A-3— Concluded. 


(d) Concluded. 


Safety 

margin. 


bility. 

Sample size, N 


r 

30 

40 

50 

60 

70 

80 

90 

100 

3.8 

0.9999 i 

5.5011 

5.1969 

5.0064 

4.8742 

4.7761 

4.6997 

4.6381 

4.5872 

3.9 



5.6417 

5.3301 

5.1350 

4.9996 

4.8991 

4.8209 

4.7579 

4.7058 

4.0 



5.7824 

5.4634 

5.2636 

5.1251 

5.0223 

4.9422 

4.8777 

4.8243 

4.1 



5.9232 

5.5968 

5.3923 

5.2506 

5.1454 

5.0635 

4.9975 

4.9430 

4.2 



6.0640 

5.7302 

5.5211 

5.3762 

5.2686 

5.1849 

5.1174 

5.0615 

4.3 



6.2049 

5.8637 

5.6500 

5.5019 

5.3919 

5.3063 

5.2373 

5.1803 

4.4 



6.3459 

5.9972 

5.7789 

5.6275 

5.5152 

5.4277 

5.3573 

5.2991 

4 5 



6.4870 

6.1308 

5.9078 

5.7533 

5.6385 

5.5492 

5.4773 

5.4178 

4.6 



6.6281 

6.2645 

6.0368 

5.8791 

5.7619 

5.6708 

5.5974 

5.5367 

4.7 



6.7693 

6.3982 

6.1659 

6.0049 

5.8853 

5.7923 

5.7174 

5.6555 

4.8 



6.9106 

6.5319 

6.2949 

6.1307 

6.0088 

5.9139 

5.8375 

5.7744 

4.9 

1.0000 

7.0518 

6.6657 

6.4241 

6.2566 

6.1323 

6.0356 

5.9577 

5.8933 

5.0 



7.1932 

6.7996 

6.5532 

6.3825 

6.2558 

6.1572 

6.0778 

6.0122 

5.1 



7.3346 

6.9334 

6.6824 

6.5085 

6.3794 

6.2789 

6.1980 

6.1311 

5.2 



7.4760 

7.0673 

6.8116 

6.6345 

6.5029 

6.4006 

6.3182 

6.2501 

5.3 



7.6175 

7.2013 

6.9409 

6.7605 

6.6266 

6.5224 

6.4384 

6.3691 

5.4 



7.7590 

7.3353 

7.0702 

6.8865 

6.7502 

6.6441 

6.5587 

6.4881 

5.5 



7.9005 

7.4693 

7.1995 

7.0126 

6.8738 

6.7659 

6.6790 

6.6071 

5.6 



8.0421 

7.6033 

7.3288 

7.1386 

6.9975 

6.8877 

6.7993 

6.7262 

5.7 



8.1837 

7.7374 

7.4582 

7.2648 

7.1212 

7.0095 

6.9196 

6.8452 

5.8 



8.3254 

7.8715 

7.5876 

7.3909 

7.2449 

7.1314 

7.0399 

6.9643 

5.9 



8.4671 

8.0056 

7.7170 

7.5170 

7.3687 

7.2532 

7.1603 

7.0834 

6.0 



8.6088 

8.1398 

7.8464 

7.6432 

7.4924 

7.3751 

7.2806 

7.2025 

6.1 



8.7505 

8.2740 

7.9759 

7.7694 

7.6162 

7.4970 

7.4010 

7.3217 

6.2 



8.8923 

8.4082 

8.1054 

7.8956 

7.7400 

7.6189 

7.5214 

7.4408 

6.3 



9.0341 

8.5424 

8.2348 

8.0218 

7.8638 

7.7409 

7.6418 

7.5600 

6.4 



9,1759 

8.6766 

8.3644 

8.1481 

7.9876 

7.8627 

7.7622 

7.6791 

6.5 



9.3177 

8.8109 

8.4939 

8.2744 

8.1114 

7.9847 

7.8827 

7.7983 

6.6 



9.4596 

8.9452 ! 

8.6234 

8.4006 

8.2353 

8.1067 

8.0031 

7.9175 

6.7 



9.6015 

9.0794 I 

8.7530 | 

8.5269 

8.3592 

8.2286 

8.1236 

8.0367 

6.8 



9.7434 

9.2138 

8.8826 

8.6532 

8.4830 

8.3506 

8.2440 

8.1559 

6.9 



9.8853 

9.3481 

9.0122 

8.7795 

8.6069 

8.4726 

8.3645 

8.2751 

7.0 



10.0272 

9.4824 

9.1418 

8.9059 

8.7308 

8.5946 

8.4850 

8.3944 

7.1 



10.1692 

9.6168 

9.2714 

9.0322 

8.8547 

8.7167 

8.6055 

8.5136 

7.2 



10.3112 

9.7512 

9.4010 

9.1586 

8.9786 

8.8387 

8.7260 

8.6329 

7.3 



10.4532 

9.8856 

9.5307 

9.2849 

9.1026 

8.9607 

8.8465 

8.7521 

7.4 



10.5952 

10.0200 

9.6603 

9.4113 

9.2265 

8.0828 

8.9670 

8.8714 

7.5 



10.7372 

10.1544 

9.7900 

9.5377 

9.3505 

9.2048 

9.0876 

8.9907 

7.6 



10.8792 

10.2888 

9.9197 

9.6641 

9.4744 

9.3269 

9.2081 

9.1100 

7.7 



11.0213 

10.4233 

10.0494 

9.7905 

9.5984 

9.4490 

9.3287 

9.2293 

7.8 



11.1633 

10.5577 

10.1791 

9.9169 

9.7224 

9.5711 

9.4492 

9.3486 

7.9 



11.3054 

10.6922 

10.3088 

10.0433 

9.8464 

9.6932 

9.5698 

9.4679 

8.0 



11.4475 

10.8266 

10.4385 

10.1698 

9.9704 

9.8153 

9.6904 

9.5872 
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TABLE A-4 — SAFETY MARGINS AT 95-PERCENT CONFIDENCE LEVEL 


(a) Sample sizes S lo 12 
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TABLE A -4. —Continued . 
(a) Concluded. 


Safety 

margin. 

Probat 

p, 

ility. 

Sample size, N 




5 

6 

7 

8 

9 


■HI 

12 

3.8 

3.9 

4.0 

4.1 

4.2 

4.3 

4.4 

4.5 

4.6 

4.7 

4.8 

4.9 

5.0 

5.1 

5.2 

5.3 

5.4 

5.5 

5.6 

5.7 

5.8 

5.9 
6.0 
6.1 
6.2 

6.3 

6.4 

6.5 

6.6 

6.7 

6.8 

6.9 

7.0 

7.1 

7.2 

7.3 

7.4 

7.5 

7.6 

7.7 

7.8 

7.9 
8.0 

0.99 

1.0C 

99 

XX) 

9.1606 

9.3942 

9.6280 

9.8619 

10.0960 

10.3302 

10.5645 

10.7990 

11.0336 
11.2683 
11.5031 
11.7379 
11.9729 
12.2080 
12.4431 
12.6783 
12.9136 
13.1489 
13.3843 
13.6198 
13.8553 
14.0909 
14.3265 
14.5622 
14.7979 

15.0337 
15.2695 
15.5054 
15.7413 
15.9772 
16,2132 
16.4492 
16.6852 
16.9213 
17.1574 
17.3935 
17.6297 
17.8659 
18.1021 
18.3383 
18.5746 
18.8109 
19.0472 

8.0673 

8.2729 

8.4787 

8.6846 

8.8906 

9.0968 

9.3031 

9.5095 

9.7160 

9.9225 

10.1292 

10.3359 

10.5428 

10.7497 

10.9567 

11.1637 

11.3708 

11.5780 

11.7852 

11.9925 

12.1998 

12.4072 

12.6146 

12.8221 

13.0296 

13.2372 

13.4447 

13.6524 

13.8600 

14.0677 

14.2755 

14.4832 

14.6910 

14.8988 

15.1067 

15.3146 

15.5225 

15.7304 

15.9383 

16.1463 

16.3543 

16.5623 

16.7703 

7.3973 

7.5858 

7.7745 

7.9633 

8.1522 

8.3413 

8.5304 

8.7196 

8.9090 

9.0984 

8.2879 

9.4775 

9.6671 

9.8568 

10.0466 

10.2364 

10.4263 

10.6163 

10.8063 

10.9963 

11.1865 

11.3766 

11.5668 

11.7570 

11.9473 

12.1376 

12.3279 

12.5183 

12.7087 

12.8992 

13.0897 

13.2802 

13.4707 

13.6612 

13.8518 

14.0424 

14.2330 

14.4237 

14.6144 

14.8050 

14.9958 

15.1865 

15.3772 

6.9412 
7.1182 
7.2954 
7.4727 
7.6501 
7.8276 
8.0052 
8.1829 
8.3606 
8.5385 
8.7164 
8.8944 
9.0725 
9.2506 
9.4288 
9.6070 
9.7853 
9.9636 
10. 1420 
10.3205 
10.4989 
10.6775 
10.8560 
11.0346 
11.2133 
11.3919 
11.5706 
11.7494 
11.9281 
12.1069 
12.2857 
12.4646 
12.6434 
12.8223 
13.0013 
13.1802 
13.3592 
13.5381 
13.7171 
13.8962 
14.0752 
14.2542 
14.4333 

6.6073 

6.7758 

6.9444 

7.1132 

7.2820 

7.4510 

7.6200 

7.7891 

7.9583 

8.1276 

8.2969 

8.4664 

8.6358 

8.8054 

8.9750 

9.1446 

9.3143 

9.4841 

9.6539 

9.8237 

9.9936 

10.1635 

10.3335 

10.5034 

10.6735 

10.8435 

11.0136 

11.1837 

11.3539 

11.5241 

11.6943 

11.8645 

12.0347 

12.2050 

12.3753 

12.5456 

12.7160 

12.8863 

13.0567 

13.2271 

13.3975 

13.5679 

13.7384 

6.3502 

6.5124 

6.6746 

6.8369 

6.9994 

7.1619 

7.3245 

7.4872 

7.6499 

7.8128 

7.9757 

8.1386 

8.3017 

8.4647 

8.6279 

8.7910 

8.9543 

9.1176 

9.2809 

9.4442 

9.6076 

9.7710 

9.9345 

10.0980 

10.2615 

10.4251 

10.5887 

10.7523 

10.9160 

11.0796 

11.2433 

11.4070 

11.5708 

11.7345 

11.8983 

12.0621 

12.2259 

12.3898 

12.5536 

12.7175 

12.8814 

13.0453 

13.2092 

6.1462 

6.3032 

6.4603 

6.6175 

6.7748 

6.9322 

7.0896 

7.2472 

7.4048 

7.5625 

7.7202 

7.8780 

8.0358 

8.1938 

8.3517 

8.5097 

8.6678 

8.8259 

8.9840 

9.1422 

9.3004 

9.4586 

9.6169 

9.7752 

9.9336 

10.0919 

10.2503 

10.4087 

10.5672 

10.7257 

10.8842 

11.0427 

11.2012 

11.3598 

11.5183 

11.6769 

11.8356 

11.9942 

12.1528 

12.3115 

12.4702 

12.6289 

12.7876 

5.9804 

6.1333 

6.2852 

6.4393 

6.5924 

6.7456 

6.8989 

7.0523 

7.2057 

7.3592 

7.5128 

7.6664 

7.8200 

7.9738 

8.1275 

8.2813 

8.4352 

8.5891 

8.7430 

8.8970 

9.0510 

9.2050 

9.3591 

9.5132 

9.6673 

9.8215 

9.9757 

10.1299 

10.2841 

10.4384 

10.5927 

10.7470 

10.9013 

11.0556 

11.2100 

11.3644 

11.5187 

11.6732 

11.8276 

11.9820 

12.1365 

12.2909 

12.4454 
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TABLE A-4. — Continued, 
(b) Sample sizes 13 to 20 



2 


NASA/TP— 2000-207428 












































TABLE A-4— Continued, 
(c) Concluded. 


Safety 

margin, 

S M 

Probab 

p. 

ility, 

Sample size, S 





m 

a 

a 

24 

25 

26 

27 

28 

3.8 

3.9 

4.0 

4.1 

4.2 

4.3 

4.4 
4 5 

4.6 

4.7 

4.8 

4.9 

5.0 

5.1 

5.2 

5.3 

5.4 

5.5 

5.6 

5.7 

5.8 

5.9 
6.0 
6.1 
6.2 

6.3 

6.4 

6.5 

6.6 

6.7 

6.8 

6.9 

7.0 

7.1 

7.2 

7.3 

7.4 

7.5 

7.6 

7.7 

7.8 

7.9 
8.0 

0.99 

1.0C 

99 

)00 

i 

* r 

5.2254 

5.3595 

5.4937 

5.6280 

5.7623 

5.8967 

6.0311 

6.1657 

6.3002 

6.4348 

6.5695 

6.7042 

6.8389 

6.9737 

7.1085 

7.2433 

7.3782 

7.5131 

7.6480 

7.7830 

7.9180 

8.0530 

8.1880 

8.3231 

8.4582 

8.5933 

8.7284 

8.8635 

8.9987 

9.1338 

9.2690 

9.4042 

9.5395 

9.6747 

9.8099 

9.9452 

10.0805 

10.2158 

10.3511 

10.4864 

10.6217 

10.7570 

10.8924 

5.1795 

5.3125 

5.4456 

5.5787 

5.7119 

5.8452 

5.9785 

6.1119 

6.2453 

6.3788 

6.5123 

6.6458 

6.7794 

6.9131 

7.0467 

7.1804 

7.3141 

7.4479 

7.5817 

7.7155 

7.8493 

7.9832 

8.1171 

8.2510 

8.3849 

8.5189 

8.6529 

8.7868 

8.9208 

9.0549 

9.1889 

9.3230 

9.4570 

9.5911 

9.7252 

9.8593 

9.9934 

10.1276 

10.2617 

10.3959 

10.5300 

10.6642 

10.7984 

5.1375 

5.2695 

5.4015 

5.5336 

5.6658 

5.7980 

5.9303 

6.0626 

6.1950 

6.3274 

6.4599 

6.5924 

6.7250 

6.8575 

6.9902 

7.1228 

7.2555 

7.3882 

7.5209 

7.6537 

7.7865 

7.9193 

8.0521 

8.1850 

8.3179 

8.4508 

8.5837 

8.7166 

8.8496 

8.9825 

9.1155 

9.2485 

9.3815 

9.5146 

9.6476 

9.7807 

9.9137 

10.0468 

10.1799 

10.3130 

10.4461 

10.5792 

10.7123 

5.0988 

5.2298 

5.3609 

5.4921 

5.6233 

5.7546 

5.8859 

6.0173 

6.1487 

6.2802 

6.4117 

6.5433 

6.6748 

6.8065 

6.9381 

7.0698 

7.2015 

7.3333 

7.4651 

7.5969 

7.7287 

7.8605 

7.9924 

8.1243 

8.2562 

8.3881 

8.5201 

8.6520 

8.7840 

8.9160 

9.0480 

9.1801 

9.3121 

9.4442 

9.5762 

9.7083 

9.8404 

9.9725 

10.1046 

10.2368 

10.3689 

10.5010 

10.6332 

5.0631 

5.1933 

5.3235 

5.4538 

5.5841 

5.7145 

5.8449 

5.9754 

6.1060 

6.2366 

6.3672 

6.4979 

6.6286 

6.7593 

6.8901 

7.0209 

7.1517 

7.2826 

7.4134 

7.5443 

7.6753 

7.8062 

7.9372 

8.0682 

8.1992 

8.3303 

8.4613 

8.5924 

8.7235 

8.8546 

8.9857 

9.1168 

9.2480 

9.3791 

9.5103 

9.6415 

9.7727 

9.9039 

10.0351 

10.1664 

10.2976 

10.4288 

10.5601 

5.0300 
5.1593 
5.2887 
5.4182 
5.5477 
5.6773 
5.8069 
5.9366 
6.0663 
6.1961 
6.3259 
6.4558 
6.5856 
6.7156 
6.8455 
6.9755 
7.1055 
7.2355 
7.3656 
7.4957 
7.6258 
7.7559 
7.8861 
8.0162 
8.1464 
8.2766 1 
8.4069 
8.5371 
8.6674 
8.7976 
8.9279 
9.0582 
9.1885 
9.3189 
9.4492 
9.5796 
9.7099 
9.8403 
9.9707 
10.1011 
10.2315 
10.3619 
10.4923 

4.9992 

5.1278 

5.2564 

5.3851 

5.5139 

5.6427 

5.7716 

5.9005 

6.0295 

6.1585 

6.2875 

6.4166 

6.5457 

6.6749 

6.8041 

6.9333 

7.0625 

7.1918 

7.3211 

7.4504 

7.5797 

7.7091 

7.8385 

7.9679 

8.0973 

8.2267 

8.3562 

8.4857 

8.6152 

8.7447 

8.8742 

9.0037 

9.1333 

9.2628 

9.3924 

9.5220 

9.6516 

9.7812 

9.9108 

10.0404 

10.1700 

10.2997 

10.4293 

4.9704 

5.0983 

5.2262 

5.3542 

5.4823 

5.6104 

5.7386 

5.8668 

5.9951 

6.1234 

6.2517 

6.3801 

6.5085 

6.6369 

6.7654 

6.8939 

7.0224 

7.1510 

7.2795 

7.4081 

7.5368 

7.6654 

7.7941 

7.9228 

8.0515 

8.1802 

8.3089 

8.4377 

8.5664 

8.6952 

8.8240 

8.9528 

9.0817 

9.2105 

9.3394 

9.4682 

9.5971 

9.7260 

9.8549 

9.9838 

10.1127 

10.2416 

10.3705 
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Probability, 

P x 


TABLE A -4 . —Concluded . 
(d) Concluded. 


Sample size, N 


4.9182 
5.0448 
5.1715 
5.2982 
5.4250 
5.5519 
5.6788 
5.8057 
5.9327 
6.0597 
6.1867 
6.3138 
6.4409 
6.5681 
6.6952 
6.8224 
6.9497 
7.0769 
7.2042 
7.3315 
7.4588 
7.5862 
7.7136 ] 
7.8409 
7.9683 
8.0958 
8.2232 
8.3506 
8.4781 
8.6056 
8.7331 
8.8606 
8.9881 
9.1157 
9.2432 
9.3708 
9.4983 
9.6259 
9.7535 
9.8811 
10.0087 
10.1363 
10.2639 


4.7295 

4.8515 

4.9736 

5.0958 

5.2179 

5.3402 

5.4624 

5.5847 

5.7071 

5.8294 

5.9519 

6.0743 

6.1968 

6.3193 

6.4418 

6.5643 

6.6869 

6.8095 

6.9321 

7.0547 

7.1774 

7.3000 

7.4227 

7.5454 

7.6681 

7.7908 

7.9136 

8.0363 

8.1591 

8.2819 

8.4047 

8.5275 

8.6503 

8.7731 

8.8960 

9.0188 

9.1417 

9.2645 

9.3874 

9.5103 

9.6332 

9.7561 

9.8790 


4.6092 

4.7283 

4.8475 

4.9667 

5.0860 

5.2053 

5.3246 

5.4440 

5.5634 

5.6828 

5.8023 

5.9218 1 

6.0413 

6.1608 

6.2804 

6.4000 

6.5196 

6.6392 

6.7588 

6.8785 

6.9982 

7.1179 

7.2376 

7.3573 

7.4770 

7.5968 

7.7165 

7.8363 

7.9561 

8.0759 

8.1957 

8.3155 

8.4354 

8.5552 

8.6750 

8.7949 

8.9147 

9.0346 

9.1545 

9.2744 

9.3943 

9.5142 

9.6341 


7.1076 

7.0101 

7.2252 

7.1262 

7.3429 

7.2423 

7.4605 

7.3584 

7.5782 

7.4745 

7.6959 

7.5906 

7.8136 

7.7068 

7.9313 

7.8229 

8.0490 

7.9390 

8.1667 

8.0552 

8.2844 

8.1714 

8.4022 

8.2875 

8.5199 

8.4037 

8.6377 

8.5199 

8.7554 

8.6361 

8.8732 

8.7523 

8.9910 

8.8685 

9.1088 

8.9847 

9.2266 

9.1009 

9.3444 

9.2171 

9.4622 

9.3334 


7.2135 

7.1597 

7.3274 

7.2727 

7.4413 

7.3858 

7.5552 

7.4989 

7.6691 

7.6120 

7.7831 

7.7251 

7.8970 

7.8382 

8.0109 

7.9513 

8.1249 

8.0644 

8.2388 

8.1776 

8.3528 

8.2907 1 

8.4668 

8.4038 

8.5807 

8.5170 

8.6947 

8.6301 

8.8087 

8.7433 

8.9227 

8.8564 

9.0367 

8.9696 

9.1507 

9.0828 
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TABLE A-5. — Continued, 
(c) Concluded. 
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TABLE A-5.— Concluded 


(d) Concluded. 


Safety Probability, 

margin, P x 


Sample size, N 




4.6447 
4.7648 
4.8849 
5.0051 
5.1253 
5.2456 
5.3659 
5.4862 
5.6066 
5.7270 
5.8474 
5.9679 
6.0883 
6.2088 
6.3294 
6.4499 
6.5705 
6.6911 
6.8117 
6.9323 
7.0529 
7.1735 
7.2942 
7.4149 I 
7.5356 
7.6563 
7.7770 
7.8978 
8.0185 
8.1393 
8.2600 
8.3808 
8.5016 
8.6224 
8.7432 
8.8640 
8.9848 
9.1056 
9.2264 
9.3473 
9.4681 
9.5890 
9.7098 


4.5057 
4.6224 
4.7392 
4.8560 
4.9728 
5.0897 
5.2066 

5.3235 
5.4405 
5.5575 
5.6745 
5.7915 
5.9086 
6.0256 
6.1427 
6.2598 
6.3770 
6.4941 
6.6113 
6.7285 
6.8456 
6.9628 
7.0801 
7.1973 I 
7.3145 
7.4318 
7.5490 
7.6663 
7.7836 
7.9009 
8.0182 
8.1355 
8.2528 
8.3701 
8.4875 
8.6048 
8.7222 
8.8395 
8.9569 
9.0743 
9.1916 
9.3090 
9.4264 


4.4165 

4.5310 

4.6456 

4.7603 

4.8749 

4.9896 

5.1044 

5.2191 

5.3339 

5.4487 

5.5635 

5.6784 

5.7932 

5.9081 

6.0230 

6.1379 

6.2528 

6.3678 

6.4828 

6.5977 

6.7127 

6.8277 

6.9427 

7.0577 

7.1728 

7.2878 | 

7.4029 

7.5179 

7.6330 

7.7481 

7.8632 

7.9783 

8.0934 

8.2085 

8.3236 

8.4387 

8.5538 

8.6690 

8.7841 

8.8992 

9.0144 

9 . 12 % 

9.2447 


4.3433 
4.4664 
4.5795 
4.6926 
4.8057 
4.9189 
5.0321 
5.1453 
5.2585 
5.3718 
5.4851 
5.5984 
5.7117 
5.8250 
5.9384 
6.0518 
6.1651 
6.2785 
6.3919 
6.5054 
6.6188 
6.7322 
6.8457 
6.9592 
7.0726 
7.1861 
7 . 29 % 
7.4131 
7.5266 
7.6401 
7.7537 
7.8672 
7.9807 
8.0943 
8.2078 
8.3214 
8.4349 
8.5485 
8.6621 
8.7756 
8.8892 
9.0028 
9.1 164 


4.3057 

4.4177 

4 . 52 % 

4.6416 

4.7536 

4.8656 

4.9776 

5.0897 

5.2018 

5.3139 

5.4260 

5.5882 

5.6503 

5.7625 

5.8747 

5.9869 

6.0991 

6.2113 

6.3236 

6.4358 

6.5481 

6.6694 

6.7727 

6.8850 

6.9973 

7 . 10 % 

7.2219 

7.3342 

7.4466 

7.5589 

7.6712 

7.7836 

7.8960 

8.0083 

8.1207 

8.2331 

8.3455 

8.4579 

8.5702 

8.6826 

8.7950 

8.9075 

9.0199 


4.2683 
4.3793 
4.4904 
4.6014 
4.7125 
4.8237 
4.9348 
5.0460 
5.1572 

5.2684 
5 . 37 % 
5.4908 
5.6021 
5.7133 
5.8246 
5.9359 
6.0472 
6.1585 
6.2698 
6.3812 
6.4925 
6.6039 
6.7152 
6.8266 
6.9380 
7.0494 
7.1608 
7.2722 
7.3836 
7.4950 
7.6064 
7.7179 
7.8283 
7.9408 
8.0522 
8.1637 
8.2751 
8.3866 
8.4981 
8.6095 
8.7210 
8.8325 
8.9440 


4.2379 

4.3482 

4.4585 

4.5688 

4.6792 

4 . 78 % 

4.9000 

5.0104 

5.1209 

5.2314 

5.3418 

5.4523 

5.5628 

5.6734 

5.7839 

5.8945 

6.0050 

6.1156 

6.2262 

6.3368 

6.4474 

6.5580 

6.6686 

6.7792 

6.8899 

7.0005 

7.1112 

7.2218 

7.3325 

7.4432 

7.5538 

7.6645 
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Appendix B 

Project Manager’s Guide to Risk Management 
and Product Assurance 


Introduction 

This appendix provides project managers with practical 
information about increasing the chances for project success by 
using the tools of risk management and product assurance. The 
elements of an effective product assurance program are 
described along with the benefits of using a product-assurance- 
oriented management approach to reduce project risk. The 
information should be especially useful to new project manag- 
ers and to others concerned with specifying product assurance 
requirements or developing risk management or product assur- 
ance plans. 

This appendix is written from the perspective of the NASA 
Glenn Research Center's Office of Safety and Assurance 
Technologies (OSAT). It begins with a general discussion of 
how OSAT supports projects at Glenn, including the roles and 
responsibilities of the project assurance lead. Then follows 
relevantdiscussions on reliability and quality assurance (R&QA) 
with respect to economics and requirements, performance- 
based contracting, and risk management. Finally, it describes 
frequently applied requirements from various product assur- 
ance disciplines. For project managers needing further infor- 
mation, a more comprehensive treatment of risk management 
and product assurance can be found in the references. 

Risk Management and Product Assurance 
at the NASA Glenn Research Center 

The NASA Glenn Office of Safety and Assurance Technolo- 
gies advises the various project offices on risk management, 
safety, and product-assurance-related issues. Also, consistent 


with the NASA Policy Directive on safety and mission success 
(ref. B-l), OSAT conducts independent assessment activities 
to reduce risk. Typically, it is more actively involved in flight 
projects where the risks of failure are often greater and poten- 
tially more severe. However, risk management and product 
assurance tools can be applied to ground-based projects as well . 

Flight projects at Glenn normally develop risk management 
and product assurance plans to define how they will manage 
risks and address the applicable product assurance require- 
ments. For many Glenn flight projects, product assurance 
requirements are specified in the Glenn Standard Assurance 
Requirements and Guidelines for Experiments (ref. B— 2). 

The Office of Safety and Assurance Technologies helps 
Glenn project managers develop their risk management and 
product assurance plans and recommends ways to mitigate 
risks and meet applicable product assurance requirements. To 
this end, OSAT developed and maintains the Glenn Product 
Assurance Manual (ref. B— 3), which contains numerous prod- 
uct assurance instructions that give suggestions for system 
safety, quality, reliability and maintainability, software, and 
materials and processes. Glenn projects often use these instruc- 
tions as is or tailor them to meet specific needs. 


Project Assurance Lead 

Role 

The project assurance lead is OSAT’s principal point of 
contact with the project and serves as an important advisor to 
the project manager. The lead provides guidance and advice 
during the preparation of project, risk management, and prod- 
uct assurance plans; the generation of statements of work, the 
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review of bidders’ proposals, and final contract negotiations. 
The project assurance lead, normally shown in the project 
organization chart in a staff position reporting to the project 
manager, works closely with the project office to ensure that 
risk management and product assurance activities are consis- 
tent with the uniqueness of the project and are as cost effective 
as possible. 


Responsibilities 

The project assurance lead helps the project manager identify 
and mitigate risks and ensures that product assurance principals 
are applied to the design, manufacture, test, handling, installa- 
tion, and operation of the project. The lead identifies and 
provides the product assurance technical support needed to 
ensure that applicable risk, safety, reliability, maintainability, 
quality assurance, materials and processes, and software re- 
quirements are satisfied. 


Economics of OSAT 

Classical curves in figure B-l show the relationship of 
product quality cost and operational cost to product quality. To 
achieve a very small percentage of product defects (high 
quality), product quality cost becomes extremely high. Con- 
versely, if the percentage of defects is high (poor quality) 
operational cost becomes extremely high. The intersection of 
the two cost curves gives the optimum goal from a cost 
viewpoint. When finalizing product assurance requirements 
for a project, the project manager should keep the optimum cost 
goal in mind. However, from an engineering perspective, there 
may be some critical items for which additional safeguards 
must be established and the need for close risk control is 
mandatory. In this situation, economics is still an important 
consideration. 


High 



Figure B-1 .—Relationship of product quality cost to operational 
cost. 


Development of OSAT Requirements 

Product assurance is a broad and diverse discipline that has 
overlapping authority with procurement, engineering, manu- 
facturing, and testing. This problem has been mitigated to some 
degree at NASA Glenn by developing and using standard 
product assurance requirements where possible and by assign- 
ing experienced project assurance leads to assist projects^ 
defining OSAT requirements. 

The project assurance lead typically has an extensive OSAT 
background and can apply skills, training, and project experi- 
ence to tailor product assurance requirements to be reasonable 
in scope and easily understood. In addition, the project assur- 
ance lead is responsible for assuring that the product assurance 
program is consistent with project objectives and that it can 
satisfy mission requirements. 

To illustrate how product assurance requirements can be 
tailored, table B-l lists the actual requirements imposed on 10 
Glenn contracts and identifies the particular project phase 
associated with each contract. 


Effect of Performance-Based Contracting 

Even though the government has moved to performance- 
based contracting, adisciplined, organized approach to product 
assurance is still essential to minimize safety risks and to 
maximize chances for mission success. Although the govern- 
ment seeks to avoid imposing “how to” requirements on perfor- 
mance-based contractors, these contractors still should follow 
good product assurance practices. To verify their doing so, the 
government develops and implements surveillance plans to 
obtain information about performance. This verification is 
accomplished primarily through “insight” rather than through 
the more traditional “oversight.” (Insight relies on reviewing 
contractor-generated data and minimizes the amount of direct 
government involvement; conversely, oversight is more intru- 
sive because it normally involves direct government monitor- 
ing of contractor processes and activities.) 


Risk Management and Product Assurance 
Plans 

NASA programs and projects are required to use risk man- 
agement as an integral part of their management process (ref. 
B^f). This requirement includes developing and implementing 
a risk management plan to identify, analyze, mitigate, track! 
and control program and/or project risks as part of a continuous 
risk management process (ref. B-5). 
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TABLE B-l -RELIABILITY AND QUALITY ASSURANCE REQUIREMENTS IMPOSED 
ON VARIOUS PROGRAM TYPES 

* r ... t M> — • I AanAratnrt W 1 


Requirement 

Aeronautics 

Space 

Energ) 



Study 

Advanced 

technology 

Develop- 

ment 

Flight 

Develop- 

ment 

Flight 

Develop- 

ment 

Opera- 

tional 

Reliability program plan 





P 

S 



Reliability program control 
Reliability program 






s 



reporting 






S 



Reliability training 






s 



Supplier control 
Reliability of Government- 






S 



furnished property 
Design specifications 
Reliability prediction 
Failure mode and effects 




G 

P 

S 



analysis 

Maintainability and human- 

L 





s 



induced failures 
Design reviews 



o 

G 

R,G 


S 



Failure reporting and cor- 
rective action 
Standardization of design 





S 




practices 





P 



W 

Parts program 
Reliability evaluation plan 





P 

s 



Testing 






s 



Reliability assessment 
Reliability inputs to 






s 



readiness review 
Reliability evaluation 






s 



program reviews 






s 



Quality status reporting 



n 

R 


s 


w 

Government audits; quality 









program audits 
Quality program plan 
Technical documents; quality 


M 

Q 

R 



C 

w 

support/design reviews 


o 

R,G 





Change control 



V 

o 

R,G 


s 



Identification control 



V 


s 



Data retrieval 
Source selection 


M 

Q 

n 

R,G 

R,G 

R 



C 

C 

w 

Procurement documents 



W 

n 




w 

Quality assurance at source 
Receiving inspection 
Receiving inspection records 


M 

M 

V 

Q 

Q 

R,G 

R»G 


s 

s 

s 



Supplier rating system 






s 



Postaward surveys 
Coordinate supplier inspec- 






s 



tion and tests 
Nonconformance informa- 






s 



tion feedback 



Q 

R,G 





Fabrication operations 


M 

A 

R 


s 

C 

w 

Article and material control 


0 



C 

w 

Cleanliness control 
Process control 
Workmanship standards 


M 

Q 

R,G 



C 

C 

w 
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TABLE B-l. — Concluded. 


Requirement 


Aeronautics 


Study 


Advanced Develop- [ Flight 

technology ment 


Space 


Develop- I Flight 
ment 


Energy 


Develop- 

ment 


Opera 

tional 


Inspection and test planning 
Inspection records; inspec- 
tion and test performance 
Contractor quality control 
actions 

Nonconformance control 
| Nonconformance documen- 
tation 

Failure analysis and correc- 
tive action 
Material review 
Material review board 
Contracting officer approval 
Supplier material review 
board 

Inspection of test equipment 
and standards 

Evaluation of standards and 
test equipment 
Measurement accuracy 
Calibration accuracy 
Calibration control 
Environmental requirements 
Remedial and preventive 
action (calibration) 

Stamp control system 
Stamp restriction 
Handling and storage 
Preserving, marking, pack 
aging, and packing 
Shipping 
Sampling plans 
Statistical planning and 
analysis 

Contractor’s responsibility 
for Government property 
Unsuitable Government 
property 


M 


M 

M 

M 


M 


M 

M 


Q 

Q 


Q 

Q 


Q 

Q 


Q 

Q 


Q 

Q 


R 

R.G 


R.G 

R 

R,G 

R 

R 


R.G 

R 

R 

R,G 

R 

R 

G 

R 

R.G 


S 

S 

S 

S 


S 

S 

s 

s 

s 


w 


c 

C 


w 


w 


w 

w 

w.c 


w 

w 


At NASA Glenn, OSAT serves as a risk management con- 
sultant to the project manager by offering OSAT risk manage- 
ment training, helping to prepare risk management and/or 
product assurance plans, conducting risk assessments, helping 
to track risks, and providing other valuable support to facilitate 
the risk management process. 

An effective product assurance program is an essential ingre- 
dient for successfully managing risks. It provides the frame- 
work and discipline needed to support a structured risk 
management approach, a characteristic of many successful 
projects. The project manager can rely on an effective product 
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assurance program to help mitigate risks in many key areas and 
thereby serve as an important risk management tool. 


Development and Implementation of 
Product Assurance Plans 

As part of an overall risk reduction strategy, Glenn projects 
and contractors develop and implement product assurance 
plans to define and perform the tasks necessary to satisfy 
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applicable product assurance requirements. The plans are 
intended to establish a disciplined, organized approach to 
product assurance, thereby minimizing safety risks and maxi- 
mizing the chances for mission or project success. 

The product assurance plan normally includes a description 
of assurance activities in the areas or disciplines discussed next. 


Assurance Reviews 

Assurance reviews help to ensure that the engineering devel- 
opment and documentation have sufficiently progressed and 
that the design and hardware are sufficiently mature to justify 
moving to the next phase of the project. These reviews ulti- 
mately require the project to demonstrate that the components, 
subsystems, and system can successfully perform their in- 
tended function under flightlike operating and environmental 
conditions. 


Verification Plan 

As part of its product assurance effort, the project develops 
a verification plan to describe the tests, analyses, and inspec- 
tions to be conducted to demonstrate hardware and/or software 
functionality and ability to safely survive expected environ- 
mental extremes. The purpose of the verification program is to 
ensure that the payload and/or experiment meets all specified 
mission requirements. This activity includes verifying that the 
design complies with the requirements and that the hardware/ 
software complies with the mission. 

Verification testing includes functional and environmental 
tests to demonstrate the ability to meet performance require- 
ments. Environmental tests consist of thermal cycling, random 
vibration, and electromagnetic interference (EMI). Note that 
environmental stress screening is an effective product assur- 
ance tool that project managers can use to verify the adequacy 
of system design and workmanship. 


System Safety 

System safety is a critical element in the product assurance 
plan. Each project must develop and implement a comprehen- 
sive system safety program to ensure project compliance with 
all applicable safety requirements, both flight and ground. 
Potential safety hazards must be identified and controlled to 
reduce the risk of injuring personnel or damaging equipment. 

The Office of Safety and Assurance Technologies provides 
direct safety support or consultation to guide projects through 
the NASA safety review process (refs. B-6 to 11); it helps 
projects determine the best design solution to meet specific 
safety requirements, conducts hazard analyses, generates 


hazard reports, develops safety compliance data packages, 
supports safety reviews, and resolves safety issues with integra- 
tion centers or payload safety review panels. 

Materials and Processes 

To assure safety and promote mission success, projects must 
exercise care in the selection, processing, inspection, and 
testing of materials. Prudent project managers invoke a com- 
prehensive materials and processes (M&P) program to ensure 
that materials meet applicable requirements for flammability, 
toxic off-gassing, vacuum out-gassing, corrosion, fluid com- 
patibility, and shelf-life control. This program and the associ- 
ated M&P assurance activities are documented in the product 
assurance plan. 

Projects prepare material identification and usage lists 
(MIUL’s) and attempt to use compliant materials to the maxi- 
mum extent possible. Regarding materials usage, projects work 
with and seek the advice of OS AT in several ways: justification 
for the use of a noncompliant material for a particular applica- 
tion and its selection for that application; preparation of mate- 
rial usage agreements (MUA’s) that contain the rationale for 
using any noncompliant materials; assurances that fabrication 
and other manufacturing processes be performed in accordance 
with accepted practices or approved procedures; and the issu- 
ance of a materials certification letter, in concert with the 
applicable NASA Materials and Processes Inter-Center Agree- 
ment, when the materials and processes used by the project are 
shown to be acceptable. 

Some applications require the certification of metallic and 
nonmetallic materials to assure that the chemical and physical 
properties of the materials are compatible with the design 
requirements. After materials are selected by the engineer and 
are precisely defined by a specification (Federal, Society of 
Automotive Engineers, American Society for Testing and 
Materials, or other available standards), the purchase order for 
steels, aluminum alloys, brass, welding rods, solder, metal 
coatings, gases, and potting compounds should require that a 
test report, a certificate of conformance (fig. B-2), or both 
accompany the vendor’s shipment. In addition to the vendor’s 
certificate, it may be necessary' to conduct periodic in-house 
tests of metallic and nonmetallic materials to assure their 
continued conformance. 


Quality Assurance 

Quality assurance (QA), another critical element of an effec- 
tive product assurance program, is documented in the product 
assurance plan and helps a project establish and satisfy quality 
requirements through all phases of the project life cycle. 
Quality assurance ( 1 ) promotes discipline, encouraging projects 
to design in quality and ensure good workmanship by using 
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CAST TECHNOLOGY INCORPORATED 


1482 E*JE lOULEVAKD 
SCHENECTADY; NEW VO«K 1230 5 



1 Finance Division (MS500-302) 

^ NASA-Lewis Research Center 

to 21000 Brockpark Road shipped to 

Cleveland, Ohio 44135 

L 


LABORATORY REPORT OF 
CHEMICAL ANALYSIS 
AND 

MECHANICAL TESTS 
(Job 1365) 

HASA-Lewis Research Center 
21000 Brockpark Road 
Cleveland, Ohio 44135 





^*cr^ed to and sworn before me 

l i \ 


WILLIAM W. LATIMEt 
Hett y Mlk m State o< New Verb 
QmMM ie S c h enectady County 
4kf Coeiieiniow Expire* March 30, 1888 


He hereby certify that tb. above dace is a true copy of the data 
resulting from teats performed la our laboratory or of the data 
furnished us^ by the laboratory perforating cbe tests. 


CTI-22 (11-bb) 


■r A? 

C. Kauro 




Figure B-2. — Typical materical certification. 
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proper controls during design, fabrication, assembly and test; 
(2) ensures that hardware and software conform to design 
requirements and that documentation accurately reflect those 
requirements; and (3) ensures that flight hardware be main- 
tained in a sufficiently clean environment to prevent exposure 
to any contaminants that could degrade performance and pos- 
sibly compromise the achievement of mission objectives. 

OSAT assists projects in developing effective quality man- 
agement systems to address areas such as configuration control, 
procurement, fabrication, inspection, electrostatic discharge 
control, and nonconformance control. It also performs quality 
audits of fabrication sources, establishes inspection require- 
ments, provides inspection and/or test monitoring services, 
makes dispositions for nonconforming material, and ensures 


that facilities maintain proper environmental controls. Project 
managers should be familiar with the good QA practices cited 
in the following sections. 

Review of Drawings 

Before releasing the engineering drawings to the manufac- 
turer, design engineers may avail themselves of the technical 
services provided by quality engineers when developing speci- 
fication callouts in the note section of the drawings (fig. B-3). 
Give precise information on materials, surface finish, process- 
ing, nondestructive testing, cleanliness, identification, packag- 
ing. Special instructions and notes are important in obtaining a 
quality product. 
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.47 
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-\.\Z5 


. o2. 

■CHAM. 45 X 


I. MATERIAL SPECIF I CAT ION: ALUM I HON ALLOT PLATE ANO SHEET 

PER FEDERAL SPECIFICATION Q0-A-250/6F {S0S3-0). 

1 . ultrasonic inspect per nil-v-* 1055 duality level II PRIOR 

TO ANT FABRICATION PROCESSES. 

3. i^ALL OVER PER ANSI 016.1-1962 UNLESS OTHERWISE NOTED. 

». THREADS SHALL OE PER Ml L-S-TT120. 

S. FLUORESCENT PENETRANT INSPECT PER HIL-T-6SS68 type I. 
METHOD A* HATER HASHABLE. IISE TRACER-TECH PENETRANT 
NO- P-13* ON APPH0TE9 E0UIVALEH7* NO CRACKS ALLOWED, 
t. IDENTIFICATION MARKING SHALL BE OONE BY THE MOROOE ELECTRO- 
LTTE PROCESS. OR OTHER ACCEPTABLE ELECTRO-CHEW CAL- 
ETCHING PROCESS. NUMBERS AHO LETTERS SHALL BE APPROJl- 
MATELY -I HIGH. 


7 AFTER THE FINAL ASSEMBLY FIT-UP HAS BEEH C0IPLETE9 IT 
ACCORDAHCE WITH 0W6.CDC34718 AHD PRIOR TO FINAL 
ASSEMBLY OF THIS PARTI THE FOLLOWING PROCEDURE SHALL SE 
ADHERES TO: 

A. CLEAR ALL SURFACES WITH TRICHLOROETHYLENE. TYPE " 

PER FEDERAL SPECIFICATION 0-T-G31!> AMEND • I. AIR 
DRY. 

B. ULTRA SON ICALLY CLEAN IN FREON PER NASA. SPEC NO- 23. A. 
c. haiole with clean liht-free nyloh GLOVES- 

0. APPLY CONTAMINATION BANRIER PER PARAGRAPH 3.3.2 Of 
(US.A.F-) COSTAMDUTIOH BARRIES SHALL 
BE FREE OF OILS AKD FORE I OH MATERIAL. SDL AHO 
IDEHTIFY BT FART HO* ASO SERIAL HO* OH TAS ATTAC."xD 
TO BAG- 

dF BOOT TO CONTAMlHATtO, 

9 ‘ rA™ “h 5 ^™ classic 

tZSS " use! Vws ss« cen- cifie.0 rod. 


io. msrtcnoH rt-R. ivm.-E.TO- 453 

II RADIOGRAPHIC INSPECT tow AMO ACCEPTANCE. SHALL 
"• WITH M.L-R- 45774 


Figure B-3.— Typical drawing specifications. 
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Use of a Process Plan 

Identify in a plan (fig. B-5) the manufacturing operations 
that must be performed in a particular sequence. The most 
commonly used processes are machining, mechanical fasten- 
ing,, grinding, brazing, welding, soldering, polishing, coatin- 
plating, radiography, ultrasonics, fluorescent penetrate inspec- 
tion, magnetic particle inspection, painting, bonding, heat 
treating, identification marking, and safety wiring. 



Figure B-4.— Typical engineering change order. 


Changes in Engineering Documents 

Early in the design phase, establish a system to control 
changes (fig. B^f) in engineering documents and to remove 
obsolete documents. Changes in released drawings, specifica- 
tions, test procedures, and related documents can be critical, 
particularly during the building and testing phases. For this 
reason, process the latest engineering data early to expedite 
their distribution to the participating line organizations. 
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Figure B~5.— -Typical process plan. 


Calibration of Measuring Devices 

Calibrate instruments when physical quantities are to be 
measured with any degree of accuracy. Calibration includes 
repair, periodic (recall) maintenance, and determination of the 
accuracy (adjustments made as required) of the measuring 
devices as compared with known standards from the National 
Institute of Standards and Technology. Figure B-6 shows a 
typical certificate of calibration. 


Inspection of Hardware Items 

Quality control inspectors check in-process items against 
acceptable quality standards and engineering documents (fig. 
B-7). Minor deviations from good quality practices are nor- 
mally resolved at the worksite; otherwise they are brought to 
the attention of the inspection supervisor. If the quality standard 
being violated is not contained in an engineering document, the 
supervisor may review the inspector s decision if risks are 
involved. If the discrepancy is a characteristic defined by an 
engineering document, the final decision is made by material 


review engineering and product assurance representatives or 
the material review board. 

Nonconformance of Hardware 

When hardware is to be built, some provision must be made 
for the orderly review and disposition of all items that are 
determined by inspection or test as not conforming to the 
drawing, specification, or workmanship requirements. The 
system most frequently used comprises two procedures: 

(1) An engineer or a product assurance representative is 
authorized to review and decide whether hardware can be 
reworked into a conforming condition without an engineering 
change, an instruction, or both. 

(2) The material review board reviews hardware that cannot 
be reworked to meet the engineering specifications. The board 
consists of engineering, product assurance, and when required, 
government representatives. In difficult situations, the board 
members consult with other organizations and persons to arrive 
at the minimum-risk decision. 
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WESTERN AUTOMATIC TEST SERVICES 


Ml Conunsrcisl Strsst 
Mo Alto, Cslifomi*. 94303 
1415) 323-6096 


CERTIFICATE OF CALIBRATION 


TO: Lit ton Industries 
960 Industrial Hay 
San earlos, CA 


DATE: 21 July 1988 


Reference: Your Order Bo. 49721 

WATS Order Bo. 8526 
TO HUM IT MAY CONCERN: 


li8ted b * l0W duly calibrated 

trles/WATS Group per your Instructions* 


by Wavacoa Indue * 


► . “•r cou ‘ Ia d»«ries/HATS Group calibration acasurements are tr.caabi* 

Calibrat*on°faci lities . ° * ^ *° th# • Xt “ t ** the Bureau's 



WAVECOW INDUSTRIE S /VATS Group 


Quantity 


I 

1 

I 

1 


Description 

NASA INPUT SYSTEM 
NASA INPUT SYSTEM 
NASA OUTPUT SYSTEM 
NASA OUTPUT SYSTEM 


Serial No. 

#1 

#2 

#1 

¥2 


( LITTON ] 

\s7y 


f**y H /fas 


FHOVISION OF WAVECOM INOUCTIMES. SOLIO STATE CIRCUITS DIVISION W? N. EASTORIA. SUNNYVALE CALIFORNIA S40M *2464031 

Figure B-6.— Typical certificate of calibration. 


238 


NASA/TP— 2000-207428 


I 



L _ 53!) „ S/N goM BODY SECTIONS ASS CHOU Y PROCEDURE Cf PA-I 71 

NASA 200 WATTS ^EHreE- SECTION ASST OATA SHECT 6 « 6 


PARA 

MO. 

DESCRIPTION 

DATE 

TECH 

ENCft 

IjE 

1 

OCBURRING ANO INSPECTION UNDER SCOrt Of ALL PARTS 

46 

JGZL 



2 

LAYOUT Of CIRCUIT * ARTS ON CIRCUIT LAYOUT SHEET 

Hr. 

ST 



3 

— ■ — - 

COLO TEST DATA 

(A) RETURN LOSS FREQUENCY O Sdb OOWN l2.Pft ■ ^ HH^ 
fal NOMINAL RETURN LOSS 7-3 <ft> 

(C) WORST SPIKE XX _ __ db 0 1 2 - 1 00 *** t 

(0) 1 L e 12030 HH f 3. 2- db, I L (> 12080 NHj 2-2- db 

I L » 1212T MHj. I'® db. I ( * 20 db 0 HK^ 

COMMENTS t (x) HARK AT Oo TPuT 

fa 

AT 

cm 

//<? 

U 

CLEAN PARTS PER LBPC-171 

l-Vf 




9 

INSPECT PARTS BEFORE STACKING 

1*7*1 




6 

STACK CIRCUIT PARTS ON BRAZING FIXTURE 

t-vr 




7 

MEASUREMENT OF CIRCUIT HEIGHTS BEFORE BRAZE (WITHOUT ALLOY) 

/P^\ * liiyf • l-uic- t 

of (C@) ))• c i.'nii V o mu t 

\Ti/ A * . ooo 4- 

M4 




6 

REMOVE 0.048'^ CERAMIC ROD 

QC VERIFY ORIENTATION ANO BRAZING rlXTURE MUMBCRS 
FURNACE TYPE Ll4« Ha. NO 

SOAK DURATION l . S’ MIN. TAP POSITION S\ 

ttL 

1&S 

Vff 

16. 


IL 

9 

rrMintTinu of ai loy AFTFR BRAZE COOO 

\+* 

Ml 

US. 


£L 

MEASUREMENT Of CIRCUIT HEIGHTS AFTER BRAZE . OOOfe 

a 1.-J214 L e t*1im . c . t o l.'ltll 

RECORO THE DIFFERENCE BETWEEN PARAGRAPH 7 ANO 9 
A *.OOOM B-*.OOOr C + • OOl3 0 




10 

SIZE OF MAHOUEL OROPPEO THROUGH BEAM HOLE id*lS 4 INCH 

ZH 

M- 



11 

VERIFY PERPCNUICULARITY INCHES OFF VERTICAL 

(J . imSO INCHES. MAXIMUM RON OUT .O0 1 INCHLS 

€> . STO INCHES FROM TOP OF SPACER ( ® MARK UP) 


M. 



12 

LEAK CHECK N* ) 




13 

FINAL COLO TEST OATA 

IA> RETURN LOSS FREUUENCY 0 Sdb DOWN J 20p4-3 MH f 

la) NOMINAL RETURN LOSS X 1 db 

IC) WORST SPIKE H db MMj 

fOI I L «V 120T8 MHj 1° db I L (* 12080 MH^ \ •* d t 

I L u i.m NH t |.f db I L * 20 db * IHUS.O HH r 

COMMENTS: (?) MARK AT OUTPUT _ - - 


/5-T 

Cu 


\k 

DISPOSITION OF ASSEMBLY ^ 

use: f .ITTON J 

REJECT: V 5 J 

DISPOSITION ! F REJECT: 

# 

- 






Figure B-7.— Typical mandatory quality control inspection points. 
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Documentation of Equipment Discrepancies 

In a design, certain characteristics are distinct, describable, 
and measurable in engineering units. Critical characteristics are 
generally identified by engineering documents and are closely 
controlled by quality assurance personnel. Whenever a design 
characteristic is determined to be nonconforming to released 
engineering requirements, one of the following reporting pro- 
cedures must be followed: 

(1) A minor discrepancy is recorded in a discrepancy log 
(fig. B-8). A disposition must be made by an engineer, an 
inspector, or both if the condition is a minor discrepancy (e.g., 
a scratch on a metal surface or excess material) that does not 
adversely affect form, fit, or function and the hardware can be 
used as is or reworked to engineering requirements. 

(2) A failure discrepancy report is written and a disposition 
is obtained through the engineering review board (ERB) if a 
mechanical, electrical, or electronic system or subsystem has 
failed to perform within the limits of a critical characteristic 
identified by an engineering drawing, specification, test proce- 
dure, or related engineering document. 


Quality Assurance Documentation of Production, Inspec- 
tion, and Test Operations 

Manufacturing, inspecting, testing, and related operations 
for major assemblies and subassemblies should be documented 
for several reasons. Such documentation can provide a status 
record of the work in progress and the work completed. Also, 
it can become a part of the permanent record of production, 
inspection, and test operations. The sophistication of the format 
and the entries in the log can be adjusted to suit the type of 
contract — research, development, or production. The chrono- 
logical entries in the log can be summarized and included in an 
acceptance data package, which contains information helpful 
to review during a contractor’s acceptance of a supplier’s 
equipment or during final Government acceptance of a contract 
end item. Figure B— 9 shows a checklist used to determine if an 
item conforms to specifications. 



Figure B-8. — Typical discrepancy log. 


240 


NASA/TP— 2000-207428 




2.0 Quality assurance checklist for conformance to specifications of 

Communications Technology Satellite (CTS) output stage tube (OST) 

OST ft/N: 2021 Classification: QTM-2(QF-2) 


2.1 Overall efficiency 


Specification: 50 percent 
minimum over CTS band of 

Actual: 40.7 percent 
minimum at 12.040 GHz. 

12.038 to 12.123 GHz, at 

Out of specification. 

saturation 

(Waiver required.) 


2.2 Center frequency 




| Specification: 12.0805 


GHz 


Actual: 12.0805 GHz 


2.3 RF power output 


Specification: 200 W 
minimum at saturation 
over CTS band of 12.038 to 
12.123 GHz 


Actual: 170 W minimum 
at 12.040 GHz. 

Out of specification. 
(Waiver required) 


2.4 Small signal bandwidth 


Specification: 3 dB 

Actual: 2.4 dB maximum 

maximum peak to peak 

peak to peak 

measured at 10 dB below 


peak saturation over the 

YV 

CTS band, 12.038 to 12.123 GHz 

yL 


Figure B-9.— Checklist for item conformance to specifications. 


Safety and Mission Assurance for Suppliers of Materials 
and Services 

Materials and services acquired by the user from outside 
sources must satisfy contract, Government, or company reli- 
ability and quality assurance requirements. The user s system 
of control should involve 

(1) Selecting acceptable or qualified sources 

(2) Performing surveys and audits of the supplier s facilities 

(3) Inspecting the received supplier’s products 

(4) Reporting and taking corrective action for problems that 
occur 


Reliability and Maintainability 

An effective reliability and maintainability program (R&M) 
can ensure that a project’s hardware and software meet mission 
design life and availability requirements. The R&M program is 
documented in the project’ s product assurance plan and includes 
tests, analyses, and other assurance activities to demonstrate 
that the project can meet the reliability and availability goals 


established. The program may also include maintainability 
analyses or demonstrations to show that equipment can be 
adequately maintained based on expected component failure 
rates 

Several ways that OSAT assists and works with projects to 
ensure that hardware and software meet R&M requirements are 
by conducting failure mode, effects, and criticality analyses 
(see the next section); developing reliability models; making 
reliability predictions; conducting reliability trade studies; pro- 
viding component selection and control design guidelines; 
conducting analyses to identify the root causes of failures; 
implementing design changes to improve reliability and main- 
tainability; developing maintenance concepts; performing spare 
parts analyses; and developing plans (e.g., preventative main- 
tenance) to address maintainability requirements. 

The fundamental objective of a failure mode, effects, and 
criticality analysis is to identify the critical failure areas in a 
design. To accomplish this identification, each functional com- 
ponent (or higher level if adequate to attain the intended 
purpose) is sequentially assumed to fail, and the broad effects 
of each such failure on the operation of the system (fig. B-10) 
are traced. More details on this subject are available in the 
LeR-W05 10.060 ISO Work Instruction. 
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Solar Array Failure Mode and Effects Analysis of Mounting and Mechanical Deployment Assembly 

for Space Electric Rocket Test II y 


Componenl 

Failure 

mode 

Cause 

Effect 

1 Criticality 

Action 

J Status 

Actuator 

assembly 

Binding 

Operation 
is erratic 

Needle valve plugged 

Tolerance buildup; 
O-ring damage; 
workmanship 

Degraded deployment 
Partial deployment 

Minor 

Major 

Spring stiffness adequacy 
and tolerances reviewed; 
tests carefully evaluated 

Workmanship inspected 

Completed 

Specified 


Actuation 

stops 

Spring failure 

No deployment 

Critical 

Data packages will be 
prepared 

Planned 

Linkage 

(mechanism 

assembly) 

Motion 

stops 

prematurely 

Binding and lockup 

Design weakness; 
poor workmanship; 
damage 

Partial deployment 
Slow deployment 

Major 

Minor 

Kinematics study disclosed 
source of binding; redesigned 

Confidence tests will verify 
elimination of failure mode 

Completed 

Planned 

Pin-puller 

assembly 

Tie-rod 
is not 
released 

Excessive load; 
squib failure; 
corrosion of pin puller; 
jamming of catch j 

Sofar array does 
not deploy 

Critical 

Need study to develop 
alternative design with 
adequate redundancy 


Mechanical 

assembly 

Attachment 
point of solar 
arrays to 
Agena bends 
or breaks 

Excessive loads 

Partial deployment 

Major 

Cold gas attitude control 
system to be programmed; 
low mode to avoid excessive 
load 

Planned 

1 

1 

Hinges 

Dind 

spring 

Workmanship 
Tolerance stackup J 

Slow deployment 

Minor 

Confidence tests 
Tolerances reviewed 

Planned 

Completed 


Figure B-10 — Typical failure mode and effects analysis. 


EEE Parts Control 

The electronic, electrical, and electromechanical (EEE) parts 
used by a project can have a major impact on its safety and 
reliability. The project must be sure that the EEE parts selected 
and used are appropriate for their application and offer the 
lowest safety risk and greatest chance for mission success based 
on cost and schedule constraints. Projects must plan and imple- 
ment an EEE parts control program consistent with reliability 
requirements and good engineering practice. 

The OS AT helps projects select parts and develop EEE parts 
identification lists. Also, it verifies that parts selected comply 
with de-rating guidelines and other requirements (e.g., radia- 
tion); conducts Alert searches in conjunction with the Govern- 
ment Industry Data Exchange Program and NASA Parts 
Advisories to identify and deal with potentially unreliable 
parts; and assists with parts screening, ensuring traceability and 
analyzing part failures (see the following sections). 


Selection and Screening 

The costs incurred during subsystem and system testing are 
inversely proportional to the money spent for examining and 
testing parts. Success is directly related to the part screening 
costs. For example, the exceptional operational life of the 
Space Electric Rocket Test II satellite is no doubt attributable 
to the extensive parts selection and screening program. 

Other factors influence parts selection and screening: the 
criticality of the hardware application, unusual environments, 
contractor experience, and in-house resources. The selection 
can range from a high-reliability part (identified in a Govern- 
ment- or industry-preferred parts handbook) to an off-the-shelf 
commercial part. Screening is a selective process as called out 
in the source control document. 
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Failure occurs 


l 

Initiate report within 24 hr 

i 

Assign number and open file 
(control) 


t 

Analyze failure 
(cognizant engineer) 


i . 

Implement corrective action 
(project team) 

I Z 

Test or verify corrective action 
(test engineer or technician and 
quality inspector) 

l H 

Concur 

(project manager and 

Office of Mission Safety and Assurance 

product assurance manager) 

E - 

Closeout file 
(control) 


Figure B-1 1 .—Failure report, analysis, and corrective action flowchart. 


Distribution: 

- Project manager 

- Office of Mission Safety and Assurance 

- Design engineer 


T 


Take corrective action 

(design engineer or working group) 


Distribution: 

- Project manager 

- Office of Mission Safety and Assurance 

- Design engineer 


Cognizant engineer for 

- Safety or materials and processes 

- Software or electrical, electronic, and 
electromechanical parts 

- Reliability and quality control 


Working group: 

- Design engineer 

- Safety or electrical, electronic, and 
electromechanical parts; 
materials and processes; 

quality inspector engineer 


Required corrective action: 

- Design, material, or process changes 

- Reworking, repair, or replacement 


Materials Identification 

Good engineering practice identifies parts, components, and 
materials with a part number, a screening serial number, a date 
code, and the manufacturer. Furthermore, the marking on parts 
and components should be affixed in a location that is easily 
seen when the item is installed in an assembly. The identifica- 
tion method and location on the item are included on a drawing, 
a specification, or other associated engineering document 
(fig. B-3, note 6). During the period of fabrication, assembly, 
and testing, the system of marking and recordkeeping should 


provide a way to trace backward from an end item to the part 
or material level. 


Failure Analysis 

Some failed parts are analyzed and investigated to determine 
the cause of the failure (fig. B-1 1). Corrective action is taken 
to assure that the problem does not recur and then the action 
is verified by testing. The problem is closed by ERB review. 
Sometimes corrective action may change a component 
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Cumulative cost 


4 



Progress through 
steps 
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application criterion, improve a packaging technique, or revise 
a test procedure. Often the detailed physical and chemical 
examination reveals that a refinement is needed in the materials 
used during the manufacturing of a part or that an improvement 
in the parts screening process is necessary. 


Software Product Assurance 

Software is generally a critical element in the safety and 
success of a project. Project managers are therefore wise to 
establish an effective software assurance program to ensure the 
safety, reliability, and quality of their software systems. Such 
a program includes a software assurance plan (typically part of 
the product assurance plan) to address software quality stan- 
dards, configuration management, testing, problem reporting, 
performance verification, certification process, and mission 
simulation. 

The software product assurance (SPA) effort is intended to 
ensure that all software hazards be identified and controlled, 
that the software be capable of meeting mission availability and 
design life requirements, that the software meet all perfor- 
mance requirements for the mission simulation, and that soft- 
ware documentation accurately reflect those requirements 
(fig.B-12). 

The OSAT can help projects develop and implement an 
effective SPA process. For example, it can prepare SPA plans 
and conduct software hazard analyses, failure tolerance analy- 
ses, and audits. It ensures that projects follow proper software 
configuration management practices. In addition, it witnesses 
or monitors software tests and verifies that results conform to 
expectations. 


Conclusion 

Project managers can realize many benefits by using risk 
management tools and a product-assurance-oriented approach 
to their projects. By applying effective product assurance 

techniques throughout the project life cycle, projects can achieve 

the highest level of safety, quality, and reliability for the 
available resources. The investment that project managers 
make to apply risk management and product assurance to their 
projects offers the probable return of increased mission safety 
and a greater probability of success. Experienced project man- 
agers consider this to be a wise investment. 
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Appendix C 

Reliability Testing Examples 


A great deal of work has been done by various researchers to 
develop probabilistic methods suitable for reliability problems 
(ref. C-l). Probabilistic methods that apply discrete and con- 
tinuous random variables to user problems are not as well 
covered in the literature. 

This appendix concentrates on four useful functions: (1) 
failure f(t), (2) reliability R(t), (3) failure rate X, and (4) hazard 
rate X'. Because we usually need to know how well a point 
estimate has been defined, some consideration is given to 
confidence intervals for these functions. The appendix also 
explains methods for planning events at the critical delivery 
milestone and closes with a brief explanation of two reliability 
case histories. 


Useful Distribution Functions 

The failure function /(f), which defines failure as a function 
of time or number of cycles, is important knowledge obtained 
from reliability testing. Failure records are kept on a particular 
piece of hardware to obtain a histogram of failures against time. 
This histogram is studied to determine which failure distribu- 
tion fits the existing data best. Once a function/(f) is obtained, 
reliability analysis can proceed. In many cases, sufficient time 
is not available to obtain large quantities of failure density 
function data. In these cases, experience can be used to deter- 
mine which failure frequency function best fits a given set of 
data. Table C-l lists seven distributions, five continuous and 
two discrete. These distributions can be used to describe the 
time-to-failure functions for various components. The deriva- 
tion of the four reliability functions for the seven listed distri- 
butions is explained in the next section (ref. C— 2). 

Derivation of Q(t), R(t), K and X' functions .— The 
unreliability function Q(t ) is the probability that in a random 
trial the random variable is not greater than /; hence, 


GM-J/Odf 

When time is the variable, the usual range is 0 to f, implying 
that the process operates for some finite time interval. This 
integral is used to define the unreliability function when fail- 
ures are being considered. 

The reliability function R{t) is given by 

*(r) = 1-2(0 

In integral form R{t) is given by 

R(t) = jV(0 df 

Differentiation yields 

d /?(r) _ d<2(r) _ p{t) 
dr dr 

The a posteriori probability of failure p y- in a given time 
interval, t { to r 2 , can be calculated by using these equations and 
is given by 
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table c 

-l.-FIT DATA FOR FAILURE FUNCTIONS 

Distribution 

Failure Fit 


Continuous distribution 

Exponential 

Normal 

WeibuU 

Gamma 
Log normal 

Complex electrical systems 
Mechanical systems subject to wear 
Mechanical, electromechanical, or electrical 
parts: bearings, linkages with fatigue loads, 
relays, capacitors, and semiconductors. 
Reduces to exponential distribution if a = r. 
0-1, and y =* 0 

Combined mechanical and electrical systems 
Mechanical parts under stress rupture loading 


Discrete distribution 

Poisson 

Binomial 

One-shot parts 

Complex electrical systems for probability 
of Nj defects 


Substituting and simplifying gives 


p f = \- 


*('.) 


The rate at which failures occur in a time interval is defined 
as the ratio of the probability of failure in the interval to the 
interval length. Thus, the equation for failure rate X is given by 


1 _ f?(r,)-/?(r 2 ) _ i_J 
(t 2 -t| )/?(;,) t 2 -t l 


*(h) 


Substituting t x - f and t 2 = t+ h into this equation gives 


The term in brackets is recognized from the calculus to be the 
derivation of /?(/) with respect to time, and the negative of this 
derivation is equal to p(t). Substituting these values gives 


I 

r^wi 

R(t) 

- d ' _ 


As an example, consider a jet airplane traveling from Cleve- 
land to Miami. This distance is about 1500 miles and could be 
covered in about 2.5 hr. The average rate of speed would be 
1500 miles divided by 2.5 hr, or 600 mph. The instantaneous 
speed may have varied anywhere from 0 to 700 mph. The air 
speed at any given instant could be determined by readme the 
speed indicator in the cockpit. Replacing the distance con- 
tmuum by failures, failure rate is analogous to average speed 
600 mph in this example, and hazard rate is analogous to 
instantaneous speed, the speed indicatorreading in this example. 

Figure C-I presents a summary of the useful frequency 
functions for the failure distributions given in table C-l . These 
functions were derived by using the defining equations given 
previously. Choose any failure function and verify that R(t), X, 
and X' are properly defined by going through the derivation 
yourself. Five reliability problems using the continuous distri- 
butions given in figure C-l are solved in the next section. 

Estimation using the exponential, normal, WeibuU, gamma, 
and log normal distributions.— As an illustration of how to use 
these equations for an electrical part that experience indicates 
will follow the exponential distribution, consider example 1. 

Example 1 : Testing of a particular tantalum capacitor showed 
that the failure density function was exponentially distributed 
For the 100 specimens tested, it was found that the mean time 
between failures t was 1000 hr. 


^ _ R{t) - R{t + h) _ R(t) - R(t + h) 

(t + h- t)R(t) ~ hR{t) 

The instantaneous failure rate in reliability literature is often 
called the hazard rate. The hazard rate X is by definition the 
limit of the failure rate as h 0. Using a previous equation and 

taking the limit of the failure rate as h -» 0 gives 

3 ' _ lim ) — R(t + h) 

A = 

Letting h = At in this equation gives 


(1) What is the hazard rate? 

(2) What is the failure rate at 100 hr and during the next 
10-hr interval? 

(3) What are the failure and reliability time functions? 
Solution 1: 


(1) Using the equations given in figure C-l for exponential 
distribution, the hazard rate is given by 



t 1000 hr /failure 


or 


3 / lim 
A ~ A/-»(T 


*(')L 


R(t + At)-R(t) 


At 


X' — 1 x 10 ^ failure/hr 
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(2) The failure rate is given by 


TABLE C-2.— TEST DATA FOR 
GIMBAL ACTUATORS 


For this case the time interval is given by 

h=t 2 -t\ = 110 -100 = 10 hr 
The necessary reliability functions are given by 
e -t 2 /r =e -‘ 10/1000 =e-° U = 0.896 


and 


e -ti/t _ e -i 10/1000 _ e -o i = 0.905 


Ordered 

Time to 

Time to 

sample 

failure. 

failure squared. 

number 

b 



hr 

(to 3 hr) 2 

1 

ttxio 3 

3600 

2 

65 

4225 

3 

68 

4624 

4 

70 

4900 

5 

75 

5625 

6 

75 

5625 

7 

80 

6400 

8 

83 

6889 

9 

85 

7225 

10 

90 

8100 

Total 

730X10 3 

57 213 


Substituting these values gives 

X = _L( l - 1 = 1 x 10 -3 failure/hr 

10 1 0.905 ) 


n 



n 


This is to be expected for the exponential case because the 
failure rate is constant with time and is always equal to the 
hazard rate. 

(3) The failure and reliability time functions are given by 




1 c -r/1000 

iooo e 


R(t) = e" //l000 

As an illustration of how to use the equations given in figure 
C— 1 for mechanical parts subject to wear using the normal 
distribution, consider example 2. 

Example 2: A gimbal actuator is being used where friction, 
mechanical loading, and temperature are the principal failure- 
causing stresses. Assume that tests to failure have been con- 
ducted on the mechanical parts, resulting in the data shown in 
table C-2. 

( 1 ) What is the mean time between failures and the standard 
deviation? 

(2) What are the hazard rate at 85 300 hr and the failure rate 
during the next 10 300-hr interval? 

(3) What are the failure and reliability time functions? 


where 

t mean time between failures, hr 
tj time to failure, hr 
n number of observations 

Therefore, using the data from table C-2, 


750 000 = 75000hr 
10 


The unbiased standard deviation 0 is given by 


0 = 


f n ' 


M 


1/2 


n- 1 


The sum terms required for this calculation are given by 


Solution 2: 

(1) The mean time between failures is given by 


^ tj = 57 2 1 3 (l 0 3 hr) 2 (column 3, table C-2) 

/= 1 
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Figure C-1 .—Summary of useful frequency functions. 
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X 

X' 

Remarks 

1 

h 

exp (~r) 


Mi 

h = *2 i 

Complex electrical 
systems 

“fr). 

1 

h 

Ft (f 2 ) 


Normal ordinate at t 
Normal area f 1 to «> 

Mechanical systems 

t 

exp 

(t 2-Y) P 
a 



a = scale parameter 
p = shape parameter 
Y = location parameter 

Mechanical or electrical systems. 

If a = l P = 0, and y = 0, reduces 
to exponential. If P = 3.5, approx- 
imates normal. 

exp 

(^-Y) P 
” a 

u 

h 

r m 

I (f 2 -y) exp 



Gamma ordinate at t 
Gamma area to « 

Same as Weibull parameters 
but may be harder to use. 

r(P) = r t M e 1 ' dt 
0 

r(p) = (p-i)r(p-i) 

Combined mechanical 
and electrical systems 

a 

exp 


a - 

1 

h 

Ctf ictf 
1 

1 J 

- 

Log normal ordinate at t 
Log normal area f 1 to °° 

Mechanical parts that fail due 
to some wearout mechanism 

X 

X' 

Remarks 

Not applicable 

Not applicable 

Nf = number of failures 
One-shot devices 

Not applicable 

Not applicable 

p = defectives 
g = effectives 
n = trials (sample size) 

Complex systems for 
probability of N f defects 


Figure C-1 .—Concluded. 
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and 


X'/ = (750) 2 = 562 500 (l0 3 hr) 2 


3 , 2.35 x IQ" 4 failure/hr , 

' 59x]0 -i = 1 .47 x 10 3 failure/hr 


The failure rate is given by 


J 57213- 56 250V / V963V /2 


= 10 300 hr 


(2) The hazard rate If is given by 

_ Scaled ordinate at 85 300 h r 
Normal area from 85 300 hr to ~ 

Let K, be the normal ordinate at 85 300 hr and Z, be the 
standardized normal variable, which is given by * 

z _ t-]_ _ (85300 - 75 OOP) hr 
‘ 0- 10 300 hr 

Existing tables for the normal ordinate values forZ= 1 .0 gives 
V t - 0.242. The scale constant K s to modify this ordinate value 
for this problem is given by (ref. C-3) 


A = i[ 
h [ Kh) 


In this case h is given as 10 300 hr. The reliability at 95 600 hr 
is given by 

R(t 2 ) = Normal area from 95 600 hr to °° 

Using the preceding procedure results in 


/?(r 2 ) = 0-023 


Substituting values gives 


0.023 8.56x10" 


10 300 hr v 0.159; 1.03 x 10 4 


v _ n6 
K s — 

a 


where 0 is the class interval. Substituting values and solving 
for T, gives e 


-8.31x10 5 failure/ hr 


(3) The constants required to write expressions for p{t) and 
R(t) are calculated as follows: 


10 x 1 failures 
10 300 hr 


x 0.242 


<j(2ji) V2 (l.03xl0 4 ): 


— 2.35 x 10 -4 failure/ hr 

Note that the denominator required to calculate If is R (t ) 
which is the normal area from 85 300 hr to 00. Existing tabl'es 
for the normal area for Z, = 1.0 (ref. C-3) give the area from 
— 00 to so that the unreliability Q(t j) is given by 

Q(t[) = 0.841 x (Area from - to Zj ) 

Because £(/,) + R(^) = 1.000, 

) = 1 000 — 0.84 1 = 0.1 59 
and the hazard rate is given by 


2£T 2 =2x(l.03xl0 4 ) 2 =2.12xl0 s 

Using the constants and substituting values gives 
p(t) = 3.87 X 10 -5 e -('- 7 -5xl0 4 )72. 12x10* 


R(t) = 3.87 x 10~ 5 f°° e i2xio" 


As an illustration for the Weibull distribution, consider 
example 3. 

Example 3: A lot of 100 stepping motors was tested to see 
what theirreliability functions were. A power supply furnished 
electrical pulses to each motor. Instrumentation recorded the 
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TABLE C-3.-WEIBULL DATA FOR STEPPING MOTORS 


Number of 
steps to 
failure 

Cumulative number 
of failures 

Median 

rank 


95-Percent 

rank 

Problem 3 

Problem 9 

Scaled time to fail 

lire, f, 

0.2X 10 3 

.4 

.9 

4.0 

10.0 
18.0 
30.9 
50.0 

2 

4 

5 
16 
20 
50 
90 
97 

1 

2 

3 

4 

5 

6 

7 

8 

6.70 

16.23 

25.86 

35.51 

45.17 

54.83 

64.49 

74.14 

0.51 

3.68 

8.73 

15.00 

22.24 

30.35 

39.34 

49.30 

25.89 

39.42 

50.69 

60.66 

69.65 

77.76 

85.00 

91.27 


number of continuous steps a motor made before it failed to step 
even though a pulse was provided. All testing was stopped at 
lxlO 6 steps. The step failure data are given in table C-3. 


a - exp 


-In In 


1 

1 -( 2(0 


(1) Calculate the frequency functions. 

(2) Plot the hazard rate function on log-log paper. 

(3) What conclusions can be drawn from this graph? 

Solution 3: Because there are 100 motors in this lot, the data 
give ordered plotting positions suitable for plotting on Weibull 
probability paper. Figure C-2 shows a plot of these data. From 
the shape of the data in figure C-2, it appears as though two 
straight lines are necessary to fit this failure density function. 
This means that different frequency functions exist at different 
times. These frequency functions are said to be separated by a 
partition parameter 6. 

From figure C-2 the Weibull scale, shape, and location 
parameters can be estimated by following these steps. 


Therefore, 

oq = e 2 ' 75 = 15.7 
a 2 = e 4 ' 6 = 100 

By using the parameters just estimated and the equations given 
in figure C-l for the Weibull distribution, the following failure 
frequency functions can be expressed: The partition limits on 
the number of steps c are 0 < c < 10 and c > 10. The frequency 
functions are given by 




(1) Estimate the partition parameter 8. This estimate can be 
obtained directly from figure C-2. The two straight lines that 
best fit the given data intersect at point/. Projecting this point 
down to the abscissa gives a failure age of 10 000 cycles for 
the partition parameter 8. 

(2) Estimate the location parameter y. This parameter is used 

as a straightener for />(/). Because p{t - 0) is already a straight 
line for both regions, it is clear that 1\ ~ 0- general, 

several tries at straightening may be required before the one 
yielding a straight line for p (t - y) is found. 

(3) Estimate the shaping parameter p The intercept point a 
for line b , drawn parallel to line c and passing through point d, 
where In(r-y) = 1 is equal to (3. Thus, [3, = 0.75 and (3 2 = 1-50. 

(4) Estimate the scale parameter a. At point e for line c, 


Substituting values results in 


Me)- 


0.75 ^Q.75-1 
15.7 C 


-(c/15.7) 075 


or 


/l( c ) = 0.47c -025 e -c 


.0.75/15.7 


Similarly, 


for 0 < c < 1 0 


/ 2 (c) = 0.015c~° 5 e 


,1.50/100 


for c > 10 


lna = -ln In 


1 

1 -GW 


so that 


The reliability functions are given by 


R(c) = e 
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S 4 


5 

eL 



log e (failure age) 
Figure C-2.— Weibull plot for stepping motors. 


Therefore, substituting values gives 




.0.75/15.7 


for 0 < c < 10 


X,=- 


1 — 


Acz) 


1.5/m 


a^y 


for c > 10 


and 


„ , v 1.5/1(10 

^>(0 =e fore >10 

The failure rate functions are given by 


X = - 

h 


1 -- 


>-( c 2-ri) 


Pi'oi 


=~( c i -Yi ) 


$\ia\ 


The hazard rate functions are given by 
V.I( c -y)M 

Therefore, substituting values gives 

"k'\ = 0.047c -0 ' 25 for 0 < c < 10 


and 


Therefore, substituting values gives 

Ac2) m 


>- 2 = 0.015c“° 5 for c > 


10 


X,= 


1 -- 


Ac,)' 


0.75/100 


for0<c<10 


and 


(2) By using two-cycle log-log paper and the following 
calculation method, a graph of X' against c can be obtained: ° 

= 0.047c -0 - 25 
Taking logarithms to the base 10 gives 
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log A.| = log 0.047 + (—0.25) log c 

Useful corollary equations are 

10 x =>’ 

x = log Y 
10 ° =1 

and 

log 0.047 = log 4.7 x 10 -2 = log 4.7 + (-2) log 10 
= 2.672, or 8.672-10 

For c = 1 , 

log = log 0.047 + (—0.25) log 1 
X[ = 0.047 

For c = 10, 

log X[ = log 0.047 + (0.25) log 10 = 2.672 - 0.25 = 2.422 
X\ = 0.0264 

In a similar manner solving for A gives the data points 
shown in table C — 4. These data are plotted in figure C— 3. 


TABLE C -4. —HAZARD 
RATE DATA FOR 
STEPPING MOTORS 


Number of 

Failures 

steps, 

per cycle. 

c 

X' 

lx HP 

0.047 

10 

.026 

10 

.015 

100 

.150 


(3) Figure C-3 indicates that the hazard rate is decreasing by 
0.25 during the first interval and is increasing by 0.50 during 
the second interval for each logarithmic unit change of c. 
It appears that step motors, for first misses, jump from the 
“infant mortality” stage into the wearout stage without any 
transition period of random failures with a constant failure rate 
(ref. C-4). 

As an illustration of combined mechanical and electrical sys- 
tems that follow the gamma distribution, consider example 4: 



TABLE C-5 — ELECTRIC ROCKET 
RELIABILITY DATA 


Ordered 

Time to 

Median 

Scaled 

Linear 

sample 

failure. 

rank 

time to 

scale 

number 



failure 

rank 


hr 






Scaled time to failure, 

- 

1 

1 037.8 

6.70 

7.2 

5.0 

2 

1 814.4 

16.23 

12.6 

15.0 

3 

2 332.8 

25.86 

16.3 

25.0 

4 

3 124.8 

35.51 

21.7 

35.0 

5 

3 614.4 

45.71 

25.1 

45.0 

6 

4 579.2 

54.83 

31.8 

55.0 

7 

5 342.4 

64.49 

37.1 

65.0 

8 

6 292.8 

74.14 

43.7 

75.0 

9 

7 920.0 

83.77 

55.0 

85.0 

10 

U 404.8 

93.30 

79.2 

95.0 


Example 4: Environmental testing of 1 0 electric rockets with 
associated power conditioning has resulted in the ordered time- 
to-failure data given in table C-5. 

(1) What is the mean time between failures? 

(2) Write the gamma failure and the reliability functions. 

(3) What is the hazard rate at 5000 hr? 

(4) What is the failure rate at 5000 hr during the next 
1000-hour interval? 

Solution 4: The essential steps for the graphical solution of 
this problem follow (ref. C— 5): 

(1) Obtain the median ranks for each ordered position; see 
table C-5. 

(2) Plot on linear graph paper (10 x 10 to the inch) median 
rank against time to failure for the range around 80-percent 
median rank. 

(3) Fit a straight line to the plotted points. For a median 
rank of 80 read the corresponding time to failure r g0 in hours. 
Figure C-4 gives a f 80 of 7200 hr. 
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With these graphical construction aids, the solution to the 
problem is readily achieved: 

(1) The mean time between failures is given by 

f =ap = 2.4xl0 3 hr x 2.25 = 5.4 x 10 3 hr 

(2) The gamma failure and reliability functions are given by 




It has been shown that y = 0; the other constants are calculated 
as follows: 


Figure C-4. — Electric rocket life. 

(4) The time-to-failure data are scaled by using the equation 



where 

*i / th scaled time to failure 

/ 80 rough estimate of 80-percent failure time 

t i i th time to failure, hr 

Table C-5 gives f- for each ordered sample. 

(5) Plot on linear graph paper (10 x 10 to the inch) median 
rank against scaled time to failure t h Figure C-5 shows the 
plotted data points for this problem. 

(6) These data points fit the gamma curve well with a (3 
estimate of 2.0; hence, it appears as though a two-parameter 
gamma distribution is required with the location parameter y 
equal to zero. The nonzero location parameter case is covered 
in the literature (ref. C-5). 

(7) Overlay the linear axis (10 spaces to the inch) of a sheet 
of five-cycle semilog paper corresponding to a P of 2.0. Plot on 
this special graph paper the linear scale rank against time-to- 
failure data given in table C-5. 

(8) Fit a straight line through the plotted points. Figure C-6 
shows the plot for these data. Two additional straight lines are 
shown in this figure: line 1 was obtained by plotting two known 
points (0.5,1) and (20,8) (ref. C-5); line 2 has one point at 
(0.5, 1 ) with a slope m. If line 1 were coincident with line 2, the 
P estimate would be sufficiently accurate. 

(9) Because the two lines are not coincident, a closer approxi- 
mation for P is obtained by taking a new midpoint coordinate 
estimate of 6.8 from figure C-6. Using existing charts gives 
P = 2-25, which satisfies the slope criteria (ref. C-5). 

(10) For a shape parameter P of 2.25, a linear scale rank of 
20 percent applies. Entering figure C-6 at this point on the 
ordinate gives a scale parameter a of 2400 hr. 


a p =(2.4xl0 3 ] 2 ' 25 

Using logarithms, logaP= 2.25(log 2.4 + log 10 3 ); performing 
the indicated operations gives log ocP = 7 6U hence 
aP = 4.25x1 0 7 . 

The second required constant is T(P) = T(2.25). Usin° the 
identity T(x+ 1)=*!, then r(2.25) = T(l .25+ 1)= 1.25' Using 
Sterling’s formula , xl =x*c~*(2kx ) ] Taking logarithms gives 

log x! = xlogx + (-x) log e + (jjflog In + log x] 

x + J j !og x - 0.434x + 0.399 
log(l .25!) = 1.75 log 1 .25 - 0.434 x 1 .25 + 0.399 = 0.026 

S ubstituting and forming the product gi ves P) = (4.24x 1 0 7 ) 

x 1 .06 = 4.5x1 0 7 . Using these constants and substituting values 
gives 

p(t) = L_^1.25 e -r/2.4xl0 3 

4.5 X10 7 

and 


= — — - — f f°V 25 e~ ,/2 - 4><lo3 d/ 

4.5 x 10 ; Jt 

(3) The hazard rate function at 5000 hr is given by 


Here 
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,-Sxl0 3 /2.4xI0 3 


4.5x 10 


TABLE C-6— TEST DATA FOR GUY SUPPORTS 


Performing the indicated operations gives 

(4.21 x 10 4 * )x fl.25 x 10 _I ) 

P(‘ l) = A = 

4.5 xlO 7 


= 1.17x 10 -4 


We can obtain /?(/,) either analytically by using this integral 
equation or graphically from figure C-6. Enter figure C-6 at a 
failure age of 5000 hr. Draw a vertical line to line 3. Project the 
intersection of fit) and 5000 hr over to the linear scale rank 
(0.605). Using a previous identity, 

R{h) = 1-0.605 = 0.395 
Substituting values gives 


1.17X10 -4 , 

"lii'xio- 1 =271X1 ° failure/hr 


(4) The failure rate function at 5000 hr during the next 
1000-hr interval is given by 


X = — i— |l-^f2) 
h-hl R(h) 


Following the procedure given previously and substituting 
values gives 


R(t 2 ) = 1-0.710 = 0.290 


, 1 f. 0.290) , 

10H 0395,)" 265X10 failure/hr 

As an illustration of mechanical parts, consider example 5: 
Example 5 : A cable used as guy supports for sail experiments 
m wind tunnel testing exhibited the time-to-failure perfor- 
mance data given in table C-6. 

(1) Write the failure and reliability functions. 

(2) What is the hazard rate at 5715 hr? 

(3) What is the failure rate during the next 3000 hr? 

Solution 5: 

(1) The essential steps for solving this problem are 


Ordered 

Time to 

Median 

5-Percent 

95-Percent 

sample 

failure, 

rank 

rank 

rank 

number 

l P 





hr 




1 

I 100 

6.7 

0.5 

25.9 

2 

! 890 

16.2 

3.7 

39.4 

3 

2 920 

25.9 

8.7 

50.7 

4 

4 100 

35.5 

1 15.0 

60.7 

5 

5 715 

45.2 

22.2 

1 69.7 

6 

8 720 

54.8 

30.3 

77.8 

7 

12 000 

64.5 

39.3 

85.0 

8 

17 500 

74.1 

49.3 

91.3 

9 

23 900 

83.3 

60.6 

96.3 

10 

46 020 

93.3 

74.1 

99.5 


(a) Obtain the median rank for each ordered position (see 

table C-6). v 

(b) Plot median rank against time to failure on log-normal 
probability graph paper (probability times two log cycles) as 
shown in figure C-7. 

(c) If a straight line can be fined to these plotted points, the 
time-to-failure function is log normal. 

(d) The mean time between failures is calculated by t ' = 
In (f ), where t = 6970 hr as shown in figure C-7 for a median 
rank of 50 percent; hence t’= 8.84. 

(e) The standard deviation is given by 


flnr^-lnf/ 

' i 


where t a _ 49 500 hr and t' L = 1020 hr as shown in figure C-7 
for a median rank and a 1 - rank of 93.3 percent; hence a . = 

(10.81 -6.93)/3= 1.28. ’ 

With these constants the expressions for p(t ) and R(t) are 
written as 

p(t) - j- 21XlQ 1 --(''-8.84) 1 2 /3.28 xIO 
t' 


R{t ) = 3.21 X 10 -1 r e -(''-8.84) 2 /3.28X10, 

J Jn« 

(2) The log-normal ordinate required for X' can be calculated 
by using the standardized normal variable table as in 
example 2. The log-normal standardized variable is given by 

7 t’-t' 8.66-8.84 

Zl ~~aT = L2Z =014 3 
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From the normal-curve ordinate tables 
Y{ = 0.395 

and 

A S= 1OxOJ 95 = 309 feilures 

2 a,. 1.28 

Substituting values gives 

p ( t ')^L- — - = 5.40 x 10 -4 failure/ hr 

’ t 5.715 xlO 3 

The log-normal area from t ' to infinity can be obtained directly 
from figure C-7 by using the 1 - rank scale. Enter the time-to- 
failure ordinate at 5715 hr; project over to the log-normal file 
function /(/) and down to the 1 - rank abscissa value of 0.638. 
Therefore, the hazard rate A,' at 57 1 5 hr is given by 

r = 5.40x10^ = g46 x 10 -4 failure/ hr 
6.38 xlO' 1 2 

(3) The failure rate during the next 3000 hr is calculated by 
knowing that R{t { ) = -0.638 at a time to failure of 5715 hr and 
by obtaining R(t 2 ) = 0.437 from figure C-7 at 8715 hr. There- 
fore, the failure rate is given by 


60x1 0 3 



.90 .70 .50 .30 .10 

1 - Rank 


Figure C-7. — Guy support life. 


3x 10 3 v 


0437 j 
0.638 J 


= 1.05x10^ 


failure/ hr 


J = -= 15000 hr - = 1000 hr /failure 
r 15 failures 


Determination of confidence limits . — In the preceding sec- 
tions, statistical estimates of various parameters have been 
made. Here we determine the methods for defining the confi- 
dence to be placed in some of these estimates. In example 1, 
tantalum capacitors with a one-parameter exponential distribu- 
tion were studied. For an exponentially distributed population, 
additional estimates follow the chi-squared distribution. As an 
illustration of how to determine confidence limits for an expo- 
nentially distributed estimate, consider example 6. 

Example 6: One hundred tantalum capacitors were tested for 
15 000 hr, during which time 15 parts failed. 

( 1 ) What is the mean time between failures? 

(2) What are the upper and lower confidence limits at 

98-percent confidence level? 


(2) The upper and lower confidence limits at some confi- 
dence level are given by 




\ 


UCL = 


2 r 


5C[i-(a/2)];2r 


1 


and 


LCL = 


2r - 

f(a/2);2r , 


where 


Solution 6: 

(1) The mean time between failures is given by 


UCL upper confidence limit, hr 

LCL lower confidence limit, hr 
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r total observed operating time, hr 

X' percentage points of chi-squared distribution 

r number of failures 

^ ~~ probability that t will be the calculated oc/2 interval 
For the 98-percent confidence level required by this problem. 



Solution 7: For the areas under the normal curve from -» to 
Z equal to 0.98 and 0.02, existing area tables give Z = ±2.06 
and r = 15 + 5 = 20 total failures, with 2 r = 40. 

Substituting values gives 


- xl/2 

e x 2 J =(2x40-l) l/2 ±2.06 


1 - ^ = 0.99 
2 

and 


2r = 30 


JC0.0l:40 = 59.7, 


X0.99;40 = 23 -4 


Hence, 


UCL = 


40xlQ 3 

23.4 


= 1709 hr 


Therefore, the chi-squared distribution values are given by 
(available from many existing tables) 

X0.01;30 = 50.9 
%0.99;30 = 14.9 


Substituting values gives 


UCL = 


30x1000 

14.9 


= 2013 hr 


and 


30x1000 

50.9 


= 589 hr 


Thus, it is known with 98-percent confidence that the limits of 
the time t lie between approximately 590 and 2010 hr. 

Determining the percentage values for the chi-squared distri- 
bution for values of r greater than 30 may also be useful. It has 
been shown that when r > 30, 


M" 2 =[2(2r)-l] I/2 ±Z 


LCL = 


40x10 

59.7 


3 

- = 670 hr 


Thus, it can be said with 98-percent confidence that t lies 
between approximately 670 and 1710 hr; as the test time 
increases, theestimated-parameterconfidence interval decreases. 

In example 2 gimbal actuators that exhibited normally dis- 
tributed time-to-failure data were analyzed. For a normally 
distributed population, additional mean estimates will also be 
normal. As an illustration of how to determine confidence 
intervals for normal estimates, consider example 8. 

Example 8: Twenty-five gimbal actuators have been tested 
The mean time between failures has been calculated to be 
75 000 hr with a standard deviation of 10 300 hr (see 
example 2). What are the upper and lower confidence limits at 
a 90-percent confidence level? 

Solution 8: The upper and lower confidence limits are given 


UCL = t + K an ~- 
n 


LCL = t - K, 


a/2 ^ 1/2 


where 


where Z is the area under the normal curve at the specified 
confidence level. Example 7 illustrates how this equation is 
used for confidence interval calculations. 

Example 7: The tantalum capacitors of example 6 have been 
operated for 5000 more hr; five additional units have failed. 
What are the confidence limits on t at the 98-percent confi- 
dence level for this additional testing? 


t mean time between failures, hr 

7^0/2 standardized normal variable 
ct unbiased standard deviation 

n number of samples 

1- a probability that t will be in calculated interval 
For this problem 
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LCL = 75 000- 


1 -a = 0.90 


and 


a = 0.10 


1.83x9820 

10 1/2 


= 69 300 hr 


- = 0.05 
2 


and from existing tables for the area under the normal curve, 
K n = 1.64. Substituting values gives 

OJ Z 


_ 1.64x10 300 A(VX . 

UCL = 75 000 + — = 7 8 400 hr 


25 


1/2 


and 


LCL = 75 000 - 


1,64x10 300 
25 1/2 


= 71 600 hr 


This means that 90 percent of the time the mean-time-between 
failures estimate t for 25 gimbal actuators, rather than the 
original 10, will be between 7 1 600 and 78 400 hr. Note that the 
sample size n has been increased to use this technique. This 
reflects the usual user pressure to learn as much as possible with 
the least amount of testing. Try to keep n > 25 in estimating 
normal parameters with this technique. If n < 25, use Student s 
t distribution (ref. C-6). To determine the effects on confidence 
intervals of reducing sample size, rework example 2 for the 
smaller sample size of 10, using Student s / distribution. The 
upper and lower confidence limits are given by 


UCL = t + t a f2 “iTT 
n 

and 

LCL = f- f a /2 ~Tn 

n 

where 

tall Student’s t variable 
5 standard deviation 

For this problem, r = n - 1 = 9, a = 0. 10, and from existing 

tables is 1.83. The standard deviation is given by 


Comparing this time interval with that calculated for a 
sample size of 25 shows that the smaller sample gives a larger 
interval of uncertainty. 

In example 3 stepping motors that exhibited Weibull distrib- 
uted time-to-failure data were studied. As a graphical illustra- 
tion of how to determine confidence intervals for a 
Weibull-distributed estimate, consider example 9. 

Example 9: Another group of stepping motors has been step 
tested as previously explained in example 3. The Weibull plot 
of percent failures for a given failure age is the same as that 
given in figure C-2. During this testing, however, only eight 
failures have occurred. What is the 90-percent confidence band 
on the reliability estimate at 4000 cycles? 

Solution 9: The data needed for graphical construction of the 
confidence lines on the Weibull plot are given in table C— 3. The 
following steps are necessary to construct the confidence lines 
in figure C-2: 

( 1) Enter the percent failure axis with the first 5-percent rank 
value hitting /(f); for failure 2 the 5-percent rank is 3.68. 

(2) Draw a horizontal line that intersects/(f) at point 1. 

(3) Draw a vertical line to cross the corresponding median 
rank; for failure 2 the median rank is 16.23. 

(4) Draw a horizontal line at the median rank, 16.23, for 
failure 2. The intersection point of the line for step (3) with this 
line is one point on the 95-percent confidence line. 

(5) Repeat steps (1) to (4) until the desired cycle life is 

covered, 4000 cycles in this case. 

(6) The 5-percent confidence line is obtained in a similar 
manner. Enter the percent failure axis with the 95-percent 
failure rank; 25.89 for failure 1 . 

(7) Draw a horizontal line that intersects/(f) at point 3. 

(8) Draw a vertical line to cross the corresponding median 
rank; 6.70 for failure 1 . 

(9) Draw a horizontal line at the median rank, 6.70, for fail- 
ure 1. The intersection point of these two lines is one point on 
the 5-percent confidence line. 

(10) Repeat steps (6) to (9) until the desired cycle life is 
covered. 


s 


' 57 213-56 250 Y 

10 J 


1/2 


= 9820 


Substituting values gives 


UCL = 75 000 + 


1.83x9820 

10 1/2 


= 80 700 hr 


A 90-percent confidence interval for/(f) at 4000 cycles is, 
from figure C-2, 1.2 to 37.5 percent. Hence, a 90-percent 
confidence interval for R(t) at 4000 cycles is 0.998 to 0.625. 

In example 5, guy supports that exhibited log-normally- 
distributed time-to-failuredata were analyzed. As a final graphi- 
cal illustration of how to determine confidence intervals for a 
log-normally-distributed estimate, consider example 10. 
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Example 10: It has been shown that the guy supports of 
example 5 exhibited a reliability of 0.638 at a time to failure of 
5715 hr. Consider now the procedure for determining the 
confidence band on this log-normal estimate. The data needed 
for the graphical construction of the 90-percent confidence 
lines on the log-normal graph of figure C-7 are also given in 
table C— 6. 

Solution 10: The steps necessary to graphically construct the 
confidence lines in figure C-7 are as follows: 

(1) Enter the rank axis with the first 5-percent rank value 
hitting /(t), the log-normal life function shown in figure C— 7; 
for ordered sample 3, the 5-percent rank is 8.7. 

(2) Draw a vertical line to intersect/(r) at point 1 as shown in 
figure C-7. 

(3) Draw a horizontal line to cross the corresponding median 
rank, for ordered sample 3, the median rank is 25.9. 

(4) The intersection point (point 2 in fig. C-7) of step (3) and 
the median-rank line is one point on the 95-percent confidence 
line. 

(5) Repeat steps (1) to (4) until the desired time to failure is 
covered; 5715 hr in this case. 

(6) The 5-percent confidence line is obtained in a similar 
manner. Enter the rank axis with the 95-percent-failure rank, 
25.9, for ordered sample 1 . 

(7) Draw a vertical line intersecting^*) at point 3. 

(8) Draw a horizontal line to cross the corresponding median 
rank, for ordered sample 1, the median rank is 6.7. 

(9) The intersection point (point 4 in fig. C-7) of these two 
lines is one point on the 5-percent confidence line. 

(10) Repeat steps (6) to (9) until the desired time to failure is 
covered. 


TABLE C-7.— POISSON DATA 
FOR SPEED CONTROLLER 


Ordered 

sample 

number 

Time to 
failure, 

•f 

hr 

I 

3 520.0 

2 

4 671.2 

3 

6 729.3 

4 

7 010.0 

5 

8 510.2 

6 

9 250.1 

7 

10 910.0 

8 

11 220.5 

9 

11 815.6 

10 

12 226.4 

Total 

85 866.3 


(3) What is the probability that 6, 7, 8, 9, or 10 failures will 
occur? What is the reliability after the fifth failure? 

Solution 11\ 


( 1) Reducing the data given in table C-7 gives the mean time 
between failures as 



8.59 xlO 4 , 

— —8.59x10 hr/failure 


At 5715 hr the 90-percent confidence interval fox fit) is, from 
figure C-7, 19.7 to 69.4 percent. Hence, a 90-percent confi- 
dence interval for R(t) at 57 1 5 hr is 0.803 to 0.306. Incidentally, 
this graphical procedure for finding confidence intervals is 
completely general and can be used on other types of life test 
diagrams. 

Estimation using the Poisson and binomial events .— The 
Poisson and binomial distributions are discrete functions of the 
number of failures N* that occur rather than of the time t. 

The Poisson distribution (fig. C-l) is a discrete function of 
the number of failures. When this distribution applies, it is of 
interest to determine the probabilities associated with a speci- 
fied number of failures in the time continuum. As an illustration 
of a complex electrical component that follows the Poisson 
distribution, consider example 1 L 

Example 11. Ten space-power speed controllers were tested 
during the rotating solar dynamic development program. The 
time-to-failure test data are given in table C-7. 


Hence, the Poisson failure density function is given by 



The reliability function is given by 


R ( N f) 8,59Xl ° 3 w ~ //8<;9 * in3 

(2) To calculate the probability of five failures in 10 000 hr, 
use the ratio 


( 1 ) Write the Poisson failure density and reliability functions. 

(2) What is the probability of five failures in 10 000 hr? 


t l.OxlO 4 

-= t = 1.16 

* 8.59 xlO 3 
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The probability of five failures in 10 000 hr is given by 


TABLE C-8 — BINOMIAL 
EXPANSION COEFFICIENTS 


P( 5 ) = 


(1.16)V 116 

5! 


2.09x0,314 
1.2 x 10 2 


= 5.47 x 10 -3 


One easy method of calculating the term (1.1 6) 5 is 


log(l . 1 6) 5 = 5 log 1 . 1 6 = 5(0. 1 48) - 0.740 


Sample 

size, 

n 

Possible 
number of 
failures 

Binomial 

expansion 

coefficients 

1 

2 

1 

2 

3 

1 2 1 

3 

4 

1\3 3/1 

4 

5 

1 4\6/4 1 
% 1 


(1.16) 5 =2.09 

(3) The reliability from the 5th to the 10th failure is the sum 
of the remaining terms in the Poisson expansion. The Poisson 
expansion in sum form is given by 

, x ^'0.314(1.16)-' 

«("/) = 2 y 

;=6 


Calculating each term and summing gives 


*K)- 2 Trw'. p>ri 

j=N f V Jl 


One simple method for obtaining the binomial expansion 
coefficients is to make use of Pascal’s triangle. Pascal found 
that there was symmetry to the coefficient development and 
explained it as shown in table C-8. Pascal’s triangle (dashed 
lines) is shown in the last column. The lower number in the 
dashed triangle is obtained by adding the two upper numbers 
(i.e., 3 + 3 = 6). 

Using these constants and expanding gives p(Nj) as 


R(6) = 0.0013 

The binomial distribution is given in figure C-l as distribu- 
tion 7. Considerable work has been done to develop the tech- 
niques suitable for using this powerful tool (refs. C- 1 and C-3). 
As an illustration consider a pyrotechnic part described in 
example 12. 

Example 12: A suspicious lot of explosive bolts is estimated 
to be 15 percent defective due to improper loading density as 
observed by neutron radiography. 

( 1 ) Calculate the probability of one defective unit appearing 
in a flight quantity of four. 

(2) Plot the resulting histogram. 

(3) What is the reliability after the first defect? 

Not many failure density data are available, but past experience 
with pyrotechnic devices has shown that the binomial distribu- 
tion applies. From the given data, the per-unit number of 
effectives q is 0.85, the per-unit number of defectives/? is 0. 1 5, 
the sample size n is 4, and the possible number of failures N f 
is 0, 1, 2, 3, or 4. The frequency functions corresponding to 
these constants are given by 



4! 

(4 -N f )'.N f \ 



4 -N, 

q ’ 


and 


p{ N f) = c t* +4q*P + 6q 2 p 2 +4 <7 p3 +P 4 

The probability of one defective unit appearing in a flight 
quantity of four is given by the second term in the expansion; 
hence, 


4<? 3 p = 4(0.85) 3 (0.15) = 0.37 

The resulting histogram for this distribution is shown in figure 
C-8. The probability that 2, 3, or 4 defects will occur, as the 
reliability after the first defect, is the sum of the remaining terms 
in the binomial expansion. This probability can be calculated 
by using the equation for R(Nj-). However, it is simpler to use 
the histogram graph and sum the probabilities over Ay from 
2 to 4; hence, 

R(2) = 0.096 + 0.0 1 1 + 0.00 1 1 = 0. 1 08 

These explosive bolts in their present form are not suitable for 
use on any spacecraft because the probability of zero defects 
is only 0.522, much below the usually desired 0.999 for pyro- 
technic spacecraft devices. 

Determination of confidence limits . — When an estimate is 
made from discrete distributions, it is expected that additional 
estimates of the same parameter will be close to the original 
estimate. It is desirable to be able to determine upper and lower 
confidence limits at some stated confidence level for discrete 
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Number of failures, N f 
Figure C-8. — Explosive bolts histogram 


distribution estimates just as is done for continuous functions of 
time. The analytical procedure for determining these intervals 
is simplified by using specially prepared tables and graphs. 
Useful tables for the binomial distribution are given in the 
literature (ref. C-3). 

As an example of how confidence intervals can be obtained 
for Poisson estimates, consider problem 13. 


Problem 13: The Poisson estimate of reliability from the 5th 
to the 1 Oth failure for speed controllers was found to be 0.00 1 3 
in a previous problem. What are the upper and lower confidence 
limits on this estimate at a 95-percent confidence level? 

The variation in t can be found by using figure C-9. Enter 
figure C-9 on the 5-percent a line at the left-hand end of the 5 
interval. Here, 77 f ( = 10.5; then t t = 10 r (77 fj )=8.57xl0 4 /10.5 
= 8160 hr. Using the left-hand end of the 4 interval gives 
Tlt 2 ~ 925 '- then h = 8.57xl0 4 /9.25 = 9530 hr. One simple 
method for finding /(5) is to use figure C-10 (ref. C-5) The 
t/i ratios of interest are 1.22, 1.16, and 1.05, respectively. For 
these ratios with N f = 5, the values of/(5) from figure C-10 
are 0.997, 0.9987, and 0.99992, respectively. Because the sum 
of the last five terms is desired, R{ 5) is 0.003, 0.0013, and 
0.0008, respectively. This means that the probability of the 
5th to the 10th failure of a speed control occurring is in the 
interval 0.0008 to 0.003 at a confidence level of 95 percent. 

As an illustration of how confidence intervals can be obtained 
for a binomial distribution, consider example 14. 

Example 14: The probability of one defective unit appearing 
in a flight quantity of four explosive bolts has been calculated 
to be 0.37. What are the upper and lower confidence limits on 
this estimate at a 90-percent confidence level? 



Figure C-9. — Poisson MTBF fixed test time. 
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Number of 
failures, 
Nf 



Figure C-10— Poisson unreliability sum. 


If the sample size is n, the number of defectives is r , and the 
confidence level is y, this example has the following con- 
straints: n = 4, r = 1, and y= 90 percent. Using these constraints, 
the upper U and lower L confidence limits can be obtained 
directly from existing tables as UCL = 0.680 and LCL - 0.026. 
This means that with a 90-percent confidence level, the prob- 
ability of one defective bolt appearing in a flight quantity of 
four is in the interval from 0.026 to 0.680. 


Sampling 

Purpose of sampling . — Sampling is a statistical method 
used when it is not practical to study the whole population. 
There are usually five reasons why sampling is necessary. 

( 1 ) Economy — It usually costs less money to study a sample 
of an item than to study the whole population. 


(2) Timeliness — A sample can be studied in less time than the 
whole population can be studied, giving prompt results. 

(3) Destructive nature of a test — Some tests require that the 
end item be consumed to demonstrate performance, leaving 
nothing to use afterwards. 

(4) Accuracy — A sample survey conducted by well-trained 
researchers usually will result in accurate and valid decisions. 

(5) Infinite population — In many analytical studies, an infi- 
nite population is available. If any information is to be used for 
decision making, it must be based on a sample. 

Choosing a sample . — Good j udgment must be used in choos- 
ing a sample. Subjective methods of choosing samples fre- 
quently result in bias. Bias is an expression, either conscious or 
subconscious, of the selector's preferences. Bias can be held to 
a minimum by using a nonsubjective method developed just for 
this purpose. Several nonsubjective sampling procedures are 
described: 
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6433 

3465 

9601 

2364 

7304 

2582 

7348 

9189 

3260 

9292 

0820 

5774 

0141 

1430 

4580 

1460 

3821 

1377 

9505 

8160 

6606 

6216 

3467 

3146 

7144 

7143 

2148 

7971 

4815 

8073 

9158 

1221 

0811 

9732 

8476 

5114 

5895 

8309 

3447 

1896 

9491 

7942 

0504 

7705 

6661 

8063 

9971 

4606 

4532 

1285 

3764 

0251 

2031 

6398 

0911 

5460 

3139 

0919 

1374 

3930 

6385 

4201 

7613 

1904 

0324 

9045 

0578 

1535 

7490 

8151 

7170 

2172 

1610 

3941 

3365 

5831 

6876 

7491 

0284 

6685 

4668 

4347 

3255 

5817 

0566 

9386 

4288 

4014 

1630 

5047 

3979 

1514 

3614 

4629 

8471 

1116 

9985 

5599 

6773 

6166 

5052 

9225 

3100 

4598 

5956 

5023 

3984 

7916 

0065 

7285 

3045 

4659 

9757 

4257 

0480 

3433 

4642 

8869 

6557 

1411 

6365 

7260 

5307 

4638 

7766 

7310 

1383 

2691 

8418 

3377 

5073 

7625 

0786 

7398 

5023 

5416 

7512 

2701 

9790 

0227 

2332 

8547 

0102 

5074 

8047 

0922 

7343 

5745 

8018 

1887 

9360 

8796 

7071 

7336 

1660 

1041 

9974 

8254 

4451 

0222 

2094 

1913 

6825 

5863 

2005 

4212 

8309 

3020 

6559 

0215 

2623 

4943 

9000 

5344 

2370 

2384 

9423 

4673 

0714 

2687 

6422 

9143 

6129 

1856 

3039 

5374 

4683 

0176 

0451 

7953 

0651 

4436 

3670 

7855 

1960 

8673 

8413 

4836 

5998 

6579 

7506 

5000 

4255 

5764 

3609 

1020 

8237 

6894 

9837 

1368 

8718 

6203 

8093 

6780 

9129 

9665 

6829 

9191 

7490 

7113 

1892 

5325 

5011 

5412 

3099 

8245 

5784 

0452 

4869 

1887 

7249 

8720 

6199 

6950 

0544 

6023 

5053 

0009 

4183 

6415 

4602 

6347 

8086 

8671 

9148 

4227 

1112 

5170 

4008 

4381 

7218 

6854 

4403 

2978 

1072 

5939 

5911 

4263 

4381 

2292 

4932 

1495 

4755 

2205 

4428 

5465 

4940 

5451 

9638 

4934 

6648 

4630 

8251 

6946 

8183 

6365 

4514 

2652 

7126 

7385 

4179 

0942 

6207 

9039 

3236 

9266 

7218 

4841 

9194 

7748 

9803 

7382 

3528 

6676 

4488 

5572 

2145 

7665 

4396 

1351 

6488 

9263 

0357 

5372 

6570 

6568 

7756 

3493 

9351 

2866 

9530 

6300 

0385 

8393 

7565 

8316 

6793 

4451 

6023 

7871 

7709 

7769 

4313 

2811 

9490 

9022 

3099 

3024 

1744 

9050 

8041 

3606 

8243 

2306 

4454 

5564 

2468 

4920 

7083 

3475 

6667 

2574 

3523 

4330 

8319 

5329 

5230 

9644 

7278 

2972 

8596 

4177 

8438 

5820 

7721 

8251 

0092 

4892 

6287 

3804 

0336 

4207 

2089 

7484 

9520 

8119 

7386 

5509 

0339 

6184 

1966 

9891 

2054 

8585 

9152 

9115 

1149 

9024 

0968 

1853 

4202 

3429 

1213 

3675 

8640 

7785 

7062 

5791 

2440 

3601 

5269 

4622 

2543 

4000 

5606 

5941 

8415 

7863 

5148 

7218 


Figure C-1 1 . — Random digits table. 


(1) Random sampling — Each item in the population has an 
equal and independent chance of being selected as a sample A 
random-digits table (fig. C-1 1 ) has been developed to facilitate 
drawing random samples and has been constructed to make the 
10 digits from 0 to 9 equally likely to appear at any location in 
the table. Adjacent columns of numbers can be combined to get 
various-sized random numbers. 

(2) Stratified sampling — Similar items in a population are 
grouped or stratified, and a random sample is selected from 
each group. 

(3) Cluster sampling — Items in a population are partitioned 
into clusters, and a random sample is selected from each cluster. 

(4) Double sampling— A random sample is selected; then, 
depending on what is learned, some action is taken or a second 
sample is drawn. After the second random sample is drawn. 


action is taken on the basis of data obtained from the combina- 
tion of both samples. 

(5) Sequential sampling — Random samples are selected and 
studied one at a time. A decision on whether to take action or 
to continue sampling is made after each observation on the 
basis of all data available at that selection. 

As an illustration of when to use various sampling methods 
consider example 15. 

Example 15 : Describe how a sample should be selected for 
three cases: 

(1) Invoices numbered from 6721 to 8966 consecutively. In 
this case, a random sampling procedure could be used based on 
the four-digit table given in figure C-1 1. Using the given 
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invoice numbers, start at the top of the left column and proceed 
down each column selecting random digits until the desired 
sample size is obtained. Disregard numbers outside the range of 
interest. 

(2) Printed circuit assemblies to compare the effectiveness of 
different soldering methods. If boards are all the same type, a 
cluster sampling procedure could be used here. Group the 
boards by soldering methods; select x joints from each cluster 
to compare the effectiveness of different soldering methods. 

(3) Residual gases in a vacuum vessel to determine the partial 
pressure of gases at various tank locations. A stratified sam- 
pling procedure could be used in this case. Stratify the tank near 
existing feedthroughs into x sections; an appropriate mass run 
could be taken from each section at various ionizer distances 
from the tank walls. Analysis would tell how the partial pres- 
sures varied with ionizer depth at the feedthrough locations. 

Sample size. — A completely general equation for determin- 
ing sample size n is given by 

i 2W = i-%) = -7 


where 

Nj desired number of time-to-failure points 
n sample size 
t t test truncation time 

This equation can be used with any of the reliability functions 
given in figure C— 1. 

As an illustration of how these equations can be applied to 
electrical parts, consider example 16, which is derived from 
example 1. 

Example 16: Tantalum capacitors with a failure rate of 
lxlO -3 failure/hr are to be tested to failure. In a 1000-hr test, 
what sample size should be used to get 25 time-to-failure data 
points? 

Solution 16: The truncated exponential reliability function is 
given by 


Solving the general sample size equation for n and substituting 
values gives 


-R(t,) 0.63 


Rounding off to the nearest whole unit gives n = 40 pieces. This 
means that 40 capacitors tested for 1000 hr should give 24 
time-to-failure data points. 


Accelerated Life Testing 

Life testing to define the time duration during which a device 
performs satisfactorily is an important measurement in 
reliability testing because it is a measure of the reliability of a 
device. The life that a device will exhibit is very much depen- 
dent on the stresses it is subjected to. The same devices in field 
application are frequently subjected to different stresses at 
varying times. It should be recognized then that life testing 
involves the following environmental factors: 

( 1 ) The use stresses may influence the life of the device and 
failure rate functions. 

(2) The field stresses could be multidimensional. 

(3) An interdependence among the stress effects exists in the 

multidimensional stress space. 

(4) Life performance may vary because most devices operate 
over a range in a multidimensional stress space. 

Testing objects to failure under multidimensional stress 
conditions is usually not practical. Even if it were, if the system 
were properly designed, the waiting time to failure would be 
quite long and therefore unrealistic. It has been shown that 
time-to-failure data are important to reliability testing, and now 
they appear difficult to obtain. These are some of the reasons 
why many are turning to accelerated life testing, such as 
compressed-time testing, advanced-stress testing, or optimum 
life estimates: 

(1) Compressed-time testing— If a device is expected to 
operate once in a given time period on a repeated cycle, life 
testing of this device may be accelerated by reducing the 
operating time cycle. The multidimensional stress condition 
need not be changed. The stresses are being applied at a faster 
rate to accelerate device deterioration. Care should be taken not 
to accelerate the repetition rate beyond conditions that allow the 
device to operate in accordance with specifications. Such 
acceleration would move the device into a multidimensional 
stress region that does not exist in field conditions and would 
yield biased information. As an illustration of compressed time 
testing, consider example 17. 

Example 17: The stepping motor in example 3 was being 
pulsed for life testing. How could this life test be accelerated? 

The power supply providing the stepping pulses may have 
been stepping at the rate of one pulse per 10 sec, resulting in a 
test time of 107 sec. These motors had a frequency response 
allowing 1 0 pulses per sec. Increasing the pulse stepping rate up 
to the frequency response limit yields comparable time-to- 
failure data in 105 sec, a savings in time of 2 orders of magnitude. 

(2) Advanced-stress testing— If a device is expected to 
operate in a defined multidimensional stress region, life testing 
of this device may be accelerated by changing the multidimen- 
sional stress boundary. Usually the changes will be toward 
increased stresses because this tends to reduce time to failure. 
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The two reasons why advanced stress testing is used are to save 
time and to see how a device performs under these stress 
conditions. Care should be exercised in changing stress bound- 
aries to be sure that unrealistic conditions leading to wrong 
conclusions are not imposed on the device. A thorough study of 
the failure mechanisms should be made to ensure that proposed 
changes will not introduce new mechanisms that are not nor- 
mally encountered. If an item has a certain failure density 
distribution in the rated multidimensional stress region, chang- 
ing the stress boundaries should not change the failure density 
distribution. Some guidelines for planning advanced-stress 
tests are 


(a) Define the multidimensional stress region for an item; 
nominal values should be centrally located. 

(b) Study the failure mechanisms applicable to this item. 

(c) On the basis of guidelines (a) and (b), decide which stresses 
can be advanced without changing the failure mechanisms. 

(d) Specify multiple stress tests to establish trends; one point 
should be on the outer surface of the multidimensional region. 

(e) Be sure that the specimen size at each stress level is 
adequate to identify the failure density function and that it has 
not changed from level to level. 

(0 Pay attention to the types of failures that occur at various 
stress levels to be sure that new failure mechanisms are not 
being introduced. 

(g) Decide whether new techniques being developed for 
advanced-stress testing apply to this item. Several popular 
techniques are described: 

(i) Sensitivity testing — Test an item at the boundary 
stress for a given time. If failure occurs, reduce stress by a fixed 
amount and retest for the same time. If no failure occurs, 
increase stress by a fixed amount and retest for the same time 
Repeat this process until 25 failures occur. This technique is 
used to define endurance limits for items. 

(ii) Least-of-JV testing — Cluster items in groups and sub- 
ject each cluster to a specified stress foragiven time. Stop at the 
first failure at each stress level. Examine failed items to ensure 
conformance to expected failure mechanisms. 

(iii) Progressive-stress testing— Test an item by starting 
at the central region in stress space and linearly accelerating 
stress with time until failure occurs. Observe both the failure 
stress level and the rate of increasing stress. Vary the rate of 
increasing stress and observe its effect on the failure stress 
magnitude. Examine failed items to ensure conformance to 
expected failure mechanics. 

As an illustration of advanced-stress testing, consider 
example 18. 

Example 18: A power-conditioning supply was being life 
tested at nominal conditions with an associated electric rocket. 
The nominal electrical, thermal, vibration, shock, and vacuum 


stresses resulted in fairly long waiting periods to failure. 
Changing the multidimensional stress conditions by a factor of 
1.25 to 2, which is usually done during development testing, 
tended to identify design deficiencies with shorter waiting 
periods without affecting the failure mechanism. 

(3) Optimum life estimate— One remaining calculation for 
nonreplacement failure or time-truncated life test is the opti- 
mum estimate of mean time between failures i. It has been 
shown (ref. C— I) that / given by the time sum divided by the 
number of failures should be modified by a censorship factor 
and a truncation time factor. The censorship factor K is caused 
by wearout failures, operator error, manufacturing errors, and 
so forth. The correction equation for i is given by (ref. C-l) 

N f 

X'< +(»-«/>, 

t =i2L_ 

N f -K 

where 


Nf number of failures 
K censorship factor 

As an illustration consider example 19. 

Example 19: The tantalum capacitor tested in example 1 
could have been stopped when 10 capacitors (580 part-hours) 
out of 100 had failed at a testing time of 100 hr. What is an 
optimistic value for t ? 

Solution 19: Inspection of the 10 failed capacitors showed 
that two units failed because of manufacturing errors. There- 
fore, ty-lQ, K=2,n= 100 capacitors, t r = 1 00 hr, and the sum 
of r ( - - 580 hr. Substituting these values into the t correction 
equation gives 


580 + (10Q- 10)100 


This is an optimistic estimate for the mean time between 
failures, but it certainly is fair and reasonable to make these 
types of corrections. 


Accept/Reject Decisions With Sequential Testing 

A critical milestone occurs in product manufacturing at 
delivery time. An ethical producer is concerned about shipping 
a product lot that does not meet specifications. The consumer 
is concerned about spending money to purchase a product that 
does not meet specifications. A test method that permits each to 
have an opportunity to obtain data for decisionmaking is 
required. 
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Sequential testing constraints . — If oc is the producer s risk 
and P is the consumer’s risk, two delivery time constants valid 
for small risks have been defined and are given as 


yv, e ~r/1000 

Pl ( N f) (,2000 j Nf\ 


A = 


Izi 

a 



1 -a 


(3) Delivery constant B defines the acceptance criteria for 
P\!Pq- Using this constraint and substituting for P, and P Q gives 


B.&A. 

1( N f) 


Nf -r/2000 


= 2 f t 


Let P l be the probability that N f failures will occur in time t 
for a specified minimum acceptable rj, and let Pq be the 
probability that ^failures will occur in time t for an arbitrarily 
chosen upper value Fq . Test rules using these four constants 
have been defined for each condition (refs. C-l and C 5): 


The minimum testing time without failure t(0) min is given by 


.„,0 - 1 ( 0 ) /200 

0.1 1 1 = (2) u e Wmm 


Solving for f(0) min gives 


(1) Accept if P|/ Pq ^ B. 

(2) Reject if P,/ P 0 S A. 

(3) Continue testing if B < Pj/Pq < A - 

Exponential parameter decisionmaking . — As an illustra- 
tion of how these testing constraints can be implemented for the 
exponential distribution, consider example 20. 

Example 20: A purchased quantity of 100 000 tantalum 
capacitors has been received. Negotiations prior to placement 
of the order had established that a = P = 0. 1 , t\ - 1 000 hr, and 
r 0 = 2000 hr and that the sequential reliability test should be 
truncated in 48 hr. 


f(0) • = 2.20 x 2000 = 4400 unit -hr 
V 'min 

The minimum number of capacitors to be life tested for 48 hr 
is given by 

4400 unit-hr n - 
48 hr =91 1 

To ensure good results, choose a sample size n that is more than 
twice n min ; for this problem, use n = 200 units. The required 
minimum testing time for 200 units is given by 


(1) Calculate A and B . 

(2) Write the expressions for P 0 and P { . 

(3) How many units should be placed on test? 

(4) Plot a sequential reliability control graph to facilitate 
decisionmaking at each failure time. 

Solution 20: 

(1) The delivery time constants are obtained by substituting 
values into the defining equations. 


A = 


1 - 0.1 

0.1 


= 9 


B = 


0.1 


1-0.1 


= 0.111 


(2) Using binomial distribution from figure C-l and substi- 
tuting values gives P 0 (Ay) and P^Nj) as 


'(°) min = 


4400 unit-hr 
200 units 


= 22.0 hr 


The test can be stopped and an accept/reject decision made at 
t v where t { is given by 

t t =48 hrx20 units = 9.6 x 10 3 unit-hr 

(4) The tantalum capacitor reliability chart is constructed by 
using five points in the ( Njr, t) plane; three of these points have 
already been calculated and are given by 

•(0) mi „=4400,»,=0 

t t = 9.6 x 10 3 , Nf =0 

t = 0,N f =0 


f sN f -r/2000 

/> o( A V) = (j5^) Nf] 


The remaining two points are calculated by using the test 
inequality given by 
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B< pl^N /) < A 


Because these boundary constraints are straight lines in the form 


In general terms the ratio p(N f ) is given by 


N j = bt + {aox c) 


p(A I/ ) = f|lj' V, e -0«, -I"'.), 

Taking natural logarithmsoftheinequality and substituting gives 
l n B<Al / ln[&)-f±-±l <InA 


the slope b is given by 


h) {h t 0 


Adding (U — 1 / 7q )r to each term gives 


Dividing all terms by ln( r 0 / t x ) gives 


_ lng , ji ‘0 t . 

In In -^-1 

f i / h ) 


_V*i 'o 


■ = 7.22 x 10 -4 


Figure C-12 shows the resulting tantalum capacitor reliability 
chart. The tantalum capacitor acceptance reliability test results 
m an “accept,” “continue to test,” or “reject” decision depend- 
ing on the failure performance of the capacitors as a function of 
operating time in unit-hours as zoned in figure C-12. 

Binomial parameters decisionmaking —Pox the binomial 
frequency function, the procedure to set up a sequential reliabil- 
ity test is similar to the Poisson methodology. Because the 
unreliability, or number of defectives, is given by 1 - R for an 
effectiveness of /?, then P } (N f ) is given in binomial form by 




±__L 

< Jld_ + *1 '0 , 
bft) Infe) 

UJ UJ 


The inequality is now in the form given by 


n N s +N f 


N s number of successful trials 

Nf number of failed trials 

Rq, R | chosen reliability values at some time t, R (j > 

The ratio P |(A^)/ P q(Nj) is given by 


a + bt<Nj<c + bt 

The Constantsa and c for this problem for zero failures are given 
by e 


In B -2.2 

°Vi>r™“~ 318 ' N/ "° 


In A 2.2 

w '-° 


*)("/) c-« l )^(«ir^ 

'■oK)'(i-« b )' v / (/%)-»/ 

Following the steps given in example 20, give four of the points 
in the (N^ t) plane: 


m min =■ 


In — 

K 


N f = 0 


The test can be stopped and an accept/reject decision made at 
the number of test truncation trials N r \ N r is given by 


N r = t t N c , N f = 0 
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Figure C- 1 2.— Tantalum capacitor reliability chart. 


where N c is the number of units chosen for testing: 


The slope b is given by 


n = 0, N f = 0 


In B 


N f = 0 


In A 

R^-Ro) 


N f = 0 


b = - 


ln| — 
R i 


In 


R^-Rq) 


The inequality equation for these conditions is given by 
a + bn<N f <c + bn 

Accept/reject charts at delivery milestones when based on 
reliability sequential testing methods provide a rigorous math- 
ematical method for deciding whether or not to accept or reject 
an order of components. The actual reliability value for these 
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TABLE C-9. -power supply problem data 


Sample 

serial 

number 

Number 

of 

fiuiures 

Reason for failure 

Repair 

time, 

hr 

I 

1 

A1A-2VR3 zener shorted 

1.2 


1 

Ground wire broke 

1.4 

j 

2 

A1A2-VR3 zener shorted; 

5.5, 7.3 



A1A2-Q2 transistor shorted 



0 

In a 250-hr test no failure occurred 

— 

2 

0 

In a 250-hr test no failure occurred j 



l 

A3AJ-C3 capacitor leaked 

9.5 

3 

1 

A3A1-C3 capacitor leaked 

9.0 


0 

In a 250-hr test no failure occurred 


4 

1 

A7A1-VR1 unsoldered joint 

.5 



A3A1-C3 capacitor leaked 

9.5 

5 

_ 

0 

In a 250-hr test no failure occurred 



components is not known and neither is it wise to consider 
reliability assessment at this critical milestone. 

Subsample f chart — The chief advantages of a subsample / 
chart are that ( 1 ) it reduces reliability acceptance testing costs, 

(2) it provides for product improvements, (3) it determines if 
statistical control exists, and (4) it determines the mean time to 
repair. 

Example 21: A power supply has the following data: 


Y ~ 1 — (^a + ^}) = 1 - (0. 1 + 0. 1) = 0.80, or 80 percent 

k = H _ 0.005 _ 

r\ 0.001 


Looking up Z a in a normal curve area table (table 3 in ref C-3) 
for * a = 0.1 shows that Z a = -1.28. The value of K 2 when 

h 5 3nd / — °' 80 ls obtained from figure 11-1 in reference C-3, 
w ere /T = 1.05. The equation for t is thus t = mK 2 = (1000) 
(1.05) = 1050 hr = 1000 hr. The rejection number R for a time 
sample of 1000 hr and a confidence level y = 0.80 is given by 

^ 1000 ( 0 . 80 ) = K 2 + Z a K + 0.5 

= 1 .05 + (1.28)1 .025 + 0.5 = 2.86 = 3 

(2) Recalculate the subsample for y = 0.50 and k = 5: From 
figure 1 1-1 in reference C-3, K 2 = 0.29. Therefore, 

t = mK 2 = (1000X0.29) = 290 hr - 250 hr 
Looking up Z a in table 3 in reference C-3 for 


(1) Acceptable reliability level r,, 0.01 failure/hr; producer’s 

reliability risk R a 10 percent; specified mean time to repair 
3.0 hr r 

(2) Lot tolerance fractional reliability deviation, r 2 , 0.005 
failure/hr; consumer’s reliability risk R^, 10 percent 1 2 3 4 5 

The product test data are given in table C-9. Use figure C-13 
to analyze these data; then answer the following questions: 

(1) What is a suitable time sample and rejection number for 

meeting the 80-percent confidence level selected by manage- 
ment? 6 

(2) What are the subsample sizes and rejection numbers? 

(3) What are the confidence levels for the various rejection 
numbers? 

(4) What are the control limits on the mean time to repair? 

(5) Plot these data on a subsample f chart. 

(6) What should be done with the manufactured units? 


shows that Z a - -0.68. Recalculate the rejection number as 

^ 250 ( 0 . 50 ) = K 1 + z a K + 0.5 = 0.29 + (0.68)0.54 + 0.5 

= 1.16—1 failure 

(3) Calculate K 2 for each value of t shown in table C-10 as 
„2 _ t _ 250 

~m"l000 = 0 - 25 for k = 5;m = 1000 hr 

Look up in figure 1 1-1 of reference C-3 the confidence level 
Y values shown in table C-10. Calculate R a for each confidence 
level. (The calculated values are shown in table C-10.) 


Solution 21: Given the product data, follow these steps: 

(1) Calculate the confidence level y, the ratio of acceptable 
reliability level to lot tolerance fractional reliability deviation 
L and the mean time between failures rn: 



1 - 0.46 
2 


= 0.27 


Look up Z a for each confidence level in table 3 of reference 
C-3 (the values are tabulated in table C-10). Recalculate the 
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SUBSAMPLE f CHART 



Figure C-1 3.— Completed subsample f chart for problem 22 





table o io.— subsample data 


/a 


7* 

percent 


*/(7) I 


250 

500 

750 

1000 


0.25 

.50 

.75 

1.0 




0.46 

.63 

.73 

.78 


0.27 

.185 

.133 

.11 


0.61 

.89 

1.11 

1.22 


rejection numbers Rtf) for each subsample (the values are 
listed in table C-10): 


%) = K 2 + z a K + Q.5 z 


^250(0.46) = 0 25 + (0-6 1)0.5 + 0.5 = 1 .05 * 1 
^500(0.63) = °- 50 + (0.89)0.7 1 + 0.5 = 1 .63 = 2 


UCL^ = 


2 ^ 

*8(0.90) 


2x4x3 

3.49 


= 6.88 hr 


LCL^ = 


2Cs(o. to) 


2x4x3 

13.4 


1.79 hr 


where/is the average number of failures and * denotes mean 
time to repair. These control limits are shown in figure C- 1 3 for 
the repair time process. The lower control limit in this case has 
no importance other than statistical completeness because any 
value less than 1.79 hr is an indication of a better maintenance 
activny than what has been specified-a desirable condition. 

The completed subsample/ chart is shown in figure C-13 
Table C-l I shows the tabulated data calculated to solve this 
problem. During the various subsample intervals, some useful 
conclusions can be drawn: 

(1) During subsample interval 1 to 4 failures 


^750(0.73) =0-75 + (1.1 1)0.87 + 0.5 = 2.21 = 2 
^1000(0.78) = 1 00 + (1.22)1 + 0.5 = 2.72 * 3 


(4) Find the control limits on the mean time to repair for i 
data given in table C-9: 


i=\ 

reject serial number 1, request an engineering investigation, 
and repair and retest serial number 1 later. 

(2) During subsample interval 5 to 8 failures 


TABLE C-l 1.— POWER SUPPLY ANALYZED DATA 
(Sample sire, 250 hr.) 


Time 

sample 

Sample 

serial 

number 

Subsample 

number 

Reason for failure 

Number of 
failures 

Repair 

time, 

hr 

Mean time 
to repair, 
hr 

1 

I 

1 

A1A2-VR3 zener shorted 

1 

1 2 




2 

Ground wire broke 

1 

1.4 

___ 



3 

AJA2-VR3 zener shorted; 

2 

5.5, 7.3 

5 1 




AHV2-Q2 transistor shorted 






4 

No failures occurred 

0 



2 

2 

5 

No failures occurred 

0 





6 

A3AI-C3 capacitor leaked 

1 

9.5 



3 

7 

A3A1-C3 capacitor leaked 

1 1 

9.0 

4.6 



8 

No failures occurred 

0 



3 

4 

9 

A7A1-VR1 unsoldered joint 

1 

0.5 




10 

A3A1-C3 capacitor leaked 

] 

9.5 




5 

11 

No failures occurred 

0 





12 

No failures occurred 

0 




Totals 

8 

48.9 

— 
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the cause identified, and appropriate corrective action worked 
out and approved by an engineering review 7 board. 
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12 
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Reliability Training Answers 


Chapter 

Answers 

1 

(B), 2 (D), 3 (C), 4 (C) 

2 

la (C), lb (B), 2a (C), 2b (B), 3a (C), 3b (A), 4a (B). 4b (C), 5ai (B), Sail (C), 
5aiii (B), 5b (C), 6a (C), 6b (B), 7a (B), 7b (C). 8a (C), 8b (C), 9 (D), 10 (A), 
11 (B), 12 (C), 13 (C), 14 (C), 15 (D), 16 (E), 17 (D), 18 (F) 

3 

la (B), lb (B), lc (C). 2a (A). 2b (C), 2c (A). 3a (B), 3b (A), 3c (B), 4 (C), 5a W5b 
(B), 6 (C), 7a (A), 7b (B), 7c (B), 7d (C), 7e (A), 8 (B), 9a (B), 9b (C), 10a (C), 10b (C), 

10c(A) 

4 

la (B), lb (B), 2a (A), 2b (A), 3 (C), 4a (B), 4b (B) 

5 

1 (C), 2 (B), 3a (C), 3b (A), 3c (C), 4a (C), 4b (B), 4c (A), 5a (C), 5b (A), 6a (C), 6b 
(C), 6c (A), 7a (B), 7b (C), 7c (C), 7d (C), 8a (A), 8b (C), 8c (B), 8d (C), 8e (B), 8f (B) 

6 

la (B), lb (C), lc (A), 2a (C). 2b (B), 2c (A), 2d (C), 3a (B), 3b (C), 3c. (B), 3c, (A) 

7 

1 (C), 2 (B). 3 (D), 4 (A), 5 (B), 6 (C), 7 (B), 8 (C) 

8 

1. Item 4, squawk, major, wrong, reliability, subsystem 

9 

1 (B), 2 (A), 3 (C), 4a (C), 4b (B), 4c (F), 5 (A). 6a (C), 6b (B), 7 (A), 8a (B), 8b (A) 

10 

1 (D), 2 (D), 3 (G), 4 (B), 5 (A), 6 (E), / (B), 8 (D), 9 (A), 10 (C), 1 1 (B), 12 (F), 1. 
(E), 14a (C), 14b (C), 15 (C), 16 (B), 17 (E), 18 (A), 19a (C), 19b (B), 19c (A) 

11 

la (C), lb (B), lc (C), Id (C), 2a (C), 2b (A), 2c (B), 2d (C) 
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Training Manual for Elements of Interface 
Definition and Control 


As part of this reliability and maintainability training manual, the authors have included in this appendix 
should note that the original page numbers and content have been retained. 
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Preface 

This technical manual was developed under the Office of Safety and Mission Assurance continuous 
training initiative. The structured information contained in this manual will enable the reader to efficien yan 
effectively identify and control the technical detail needed to ensure that flight system elements mate properly 

during assembly operations (both on the ground and in space). 

Techniques used throughout the Federal Government to define and control technical ‘^ces for bo* 
hardware and software were investigated. The proportion of technical information actually needed to 
effectively define and control the essential dimensions and tolerances of system interfaces rarely e ^eeded50 
percent of any interface control document. Also, the current Government process for interface contro y 

paper intensive. Streamlining this process can improve communication, provide significant cost savings, 

improve overall mission safety and assurance. , f 

The primary thrust of this manual is to ensure that the format, information, and control of interfaces 

between equipment are clear and understandable, containing only the information needed to 8 uaran 
interface compatibility. The emphasis is on controlling the engineering design of the interface and not on 

requirements of the system o, the workings of the tmerfacmg eqmpment. 

Interface control should take place, with rare exception, at the interfacing elements and no fu • 
'"'trr:: espial secies of fte manual. The first Pnnciples of Interface Controi. i how 

interfaces are defined. It describes the types of interface to be considered and recommends a format for the 
documentation necessary for adequate interface control. The second. The Process: Through the Design Phases, 

nrovides tailored guidance for interface definition and control. 

This manual can be used to improve planned or existing interface control 
and development It can also be used to refresh and update the corporate knowledge base. The information 
presented herein will reduce the amount of paper and data required in interface definition and control processes 
by as much as 50 percent and will shorten the time required to prepare an mterface con^ 
highlights the essential technical parameters that ensure that flight subsystems will indeed fit toother and 

function as intended after assembly and checkout. 
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Chapter 1 

Introduction 


This technical manual resulted from an investigation of 
techniques used throughout NASA and other Federal Govern- 
ment agencies to define and control technical interfaces for 
both hardware and software. The processes described herein 
distill the requirements for interface definition and control into 
a concise set of parameters that control the design of only the 
interface-related elements rather than providing extraneous 
design detail that must subsequently be configuration 
managed. 

The purpose of this manual is to provide guidelines tor 
establishing and conducting the interface control process so 
that items produced by different design agencies satisfactorily 
mate and operate in a way that meets mission requirements. 
These guidelines were drawn from the methodologies of a 
number of highly successful programs and therefore represent 
a compilation of “lessons learned. 

The principles and processes of interface definition and 
control presented in this document apply to all projects and 
programs but may be tailored for program complexity. For 
example, the interface control process may be less formal for a 
project or program that requires only one or two end items and 
has few participants; however, the formal interface control 
document is still necessary. For a project or program that 
requires a number of end items and where several participants 
are involved, a carefully followed interface control process is 
imperative, with comments, decisions, agreements, and com- 
mitments fully documented and tracked. Individual managers 
should provide the implementation criteria for their interface 
control processes early in the project or program (ref. 1 ). 

This manual covers the basic principles of interface defini- 
tion and control; how to begin an interface control program 
during the development of a new project or program, how to 
develop and produce interface documentation, how to manage 
the interface control process, and how to transfer interface 
control requirements to hardware and software design. 

Interface definition and control is an integral part of system 
engineering. It should enter the system engineering cycle at the 
end of the concept development phase. Depending on whether 
the system under development is designed for one-time or 
continuous use, the process may continue for the full life cycle 
of the system. Interface definition and control should not be 
equated to configuration management or configuration control. 
Rather it is a technical management tool that ensures that all 
equipment will mate properly the first time and will continue to 
operate together as changes are made during the life cycle of the 
system. Figure 1 . 1 depicts the elements of the system engineer- 
ing cycle and is used in chapter 3 to describe the application of 
the interface discipline at different parts of the life cycle (ref. 2). 


Establishing a system that ensures that all interface param- 
eters are identified and controlled from the initial design 
activities of a program is essential. It is not necessary that the 
fine details of these parameters be known at that time, but it is 
very important that the parameters themselves are identified, 
that everything known about them at that time is recorded and 
controlled, and that voids' are identified and scheduled for 
elimination. The latter requirement is of primary importance to 
the proper design of any interface. Initial bounding of a void and 
scheduled tightening of those bounds until the precise dimen- 
sions or conditions are identified act as a catalyst to efficient 
design and development. An enforced schedule for eliminating 
voids is one of the strongest controls on schedule that can be 
applied (ref. 3). 

The process of identifying, categorizing, defining, and docu- 
menting interfaces is discussed in the following chapter. Guid- 
ance for the analysis of interface compatibility is also provided. 


Verification — 
and validation 


Mission needs 
definition 


Technical 

oversight 


Configuration 

management 


Risk and 
systems 
analysis 



Concept 

definition 


, Systems 
integration 


Requirements 

definition 


Figure 1.1 — System engineering cycle. (The 
requirements definition phase must include 
the requirements for the interfaces as well as 
those which will eventually be reflected in the 
interface control document.) 


1 A “void” is a specific lack of information needed for control of an interface 
feature. Control and elimination of voids is fundamental to a strong interface 
definition and control program. 
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1.1 Training 2 


1. The processes explained in this manual for interface 
definition and control are 

A. A concise set of parameters that control the design of the 
interface-related elements 

B. A set of design details needed for configuration manage- 
ment 

2. The process is very important for projects that require 

A. A number of end items 

B. Involvement of several participants 

C. Comments, decisions, agreements, and commitments 
that must be fully documented and tracked 

D. All of the above 

3. What elements does the system engineering cycle contain? 

A. Mission needs, requirements, and integration 

B. Technical oversight, core design, and system configura- 


C. Mission needs definition, risk and systems analysis 
concept and requirements definitions, system Integra-’ 
tion, configuration management, technical overset, 
and verification and validation 


4a. What is a void? 

A. Bracketed data 

B. Wrong data 

C. Lack of information needed 


4b. How should voids be handled? 

A. Voids should be identified 
scheduled. 

B. Data should be analyzed. 

C. Supplier should be guided. 


and their elimination 


4c. Name a strong control needed for voids. 

A. Precise dimensions 

B. Enforced schedule 

C. Identified catalysts 


2 


Answers are given at the 


end of this manual. 
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Chapter 2 

Principles of Interface Control 


2.1 Purpose of Interface Control 

An interface is that design feature of a piece of equipment 
that affects the design feature of another piece of equipment. 
The purpose of interface control is to define interface require- 
ments so as to ensure compatibility between interrelated pieces 
of equipment and to provide an authoritative means of control- 
ling the design of interfaces. Interface design is controlled by an 
interface control document (ICD). 

These documents 

1. Control the interface design of the equipment to prevent 
any changes to characteristics that would affect compat- 
ibility with other equipment 

2. Define and illustrate physical and functional characteris- 
tics of a piece of equipment in sufficient detail to ensure 
compatibility of the interface, so that this compatibility 
can be determined from the information in the ICD alone 

3. Identify missing interface data and control the submis- 
sion of these data 

4. Communicate coordinated design decisions and design 
changes to program participants 

5. Identify the source of the interface component 

ICD’s by nature are requirements documents: they define 
design requirements and allow integration. They can cause 
designs to be the way they are. They record the agreed-to design 
solution to interface requirements and provide a control mecha- 
nism to ensure that the agreed-to designs are not changed by one 
participant without negotiated agreement of the other participant. 

To be effective, ICD’s should track a schedule path compat- 
ible with design maturation of a project (i.e., initial ICD s 
should be at the 80% level of detail at preliminary design 
review, should mature as the design matures, and should reach 
the 99% mark near the critical design review). 


2.2 Identifying Interfaces 

Identifying where interfaces are going to occur is a part of 
systems engineering that translates a mission need into a 
configured system (a grouping of functional areas) to meet that 
need. Each functional area grouping is assigned certain perfor- 


3 For purposes of this manual, a piece of equipment is a functional area assigned 
to a specific source. Thus, a piece of equipment can be an element of the space 
station, a system of a spacecraft, a work package assigned to a contractor, or a 
subsystem. 


mance requirements. These performancerequirements are trans- 
lated into design requirements as the result of parametric 
studies, tradeoff studies, and design analyses. The design 
requirements are the basis for developing the system specifica- 
tions. The boundaries between the functional areas as defined 
in the system specifications become the interfaces. Early inter- 
face discussions often contribute to final subsystem specifica- 
tions. Interface characteristics, however, can extend beyond the 
interface boundary, or interface plane, where the functional 
areas actually come together. The interface could be affected 
by, and therefore needs to be compatible with, areas that 
contribute to its function but may not physically attach. For 
example, it may be necessary to define the path of a piece of 
equipment as it traverses through another piece of equipment 
and rotates and articulates to carry out its function. Electrical 
characteristics of a transmitter and receiver separated by an 
interface plane may have to be defined for each to properly 
function. Similarly, the acoustic energy produced by one com- 
ponent and transmitted through the structure or onto another 
component may need a corresponding definition . 

Identifying interfaces early in a program is essential to 
successful and timely development. Functional analyses are 
used for analyzing performance requirements and decompos- 
ing them into discrete tasks or activities (i.e., decomposing the 
primary system functions into subfunctions at ever increasing 
levels of detail). Functional block diagrams are used to define 
data flow throughout the system and interfaces within the 
system. Once the segments and elements within a system have 
been defined, a top-level functional block diagram is prepared. 
The block diagrams are then used in conjunction with N- 
squared diagrams to develop interface data flows. The -V- 
squared diagram is a technique used extensively to develop data 
interfaces but can also be refined for use in defining hardware 
interfaces. However, use of this tool in this manual will be 
restricted to interface categorization. Additional description is 
provided in section 3.1.1. 

In summary, identifying where interfaces are going to occur 
begins the systems integration component of systems engi neer- 
ing and must start early in design planning. The interface 
boundaries or planes vary from program to program depending 
on how design and development responsibilities are assigned. 
Interface control can occur within a functional area of other 
design and development agents. Therefore, interfaces can be 
identified at many levels, for example, 

1. Center to center 

2. Discipline to discipline (e.g., propulsion to guidance, 
sensor to structure, or power to users) 

3. Contractor to contractor 
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4. Center to contractor to discipline 

5. Program to program (e.g., shuttle to National Launch 
System) 

Once interface boundaries or planes are established, the 
interfaces must be categorized and defined. 

2.3 Categorizing (Partitioning) and 
Defining Interfaces 

Categorizing, or partitioning, interfaces separates the inter- 
face features by technical discipline and allows each category, 
in most cases, to proceed through the definition process 
independently. 

The following basic interface categories (defined by the 
types of feature and data they encompass) are recommended for 
use in most programs: 

1. Electrical/functional 

2. Mechanical/physical 

3. Software 

4. Supplied services 

During the early phases of systems engineering, interfaces 
may be assigned only the high-level designation of these 
categories. As the system becomes better defined, the details of 
the physical and functional interface characteristics become 
better defined and are documented. 

An interface can be assigned to one of these categories by a 
number of processes of elimination. The one recommended for 
use is the JV-squared diagram (ref. 4), which is currently beino 
used by some NASA centers. 

23.1 Electrical/Functional 

Electrical/functional interfaces are used to define and con- 
trol the interdependence of two or more pieces of equipment 
when the interdependence arises from the transmission of an 
electrical signal from one piece of equipment to another. All 
electrical and functional characteristics, parameters, and toler- 
ances of one equipment design that affect another design are 
controlled by the electrical/functional ICD. The functional 
mechanizations of the source and receiver of the interface 
elecoical signal are defined, as well as the transmission 
medium. 

The interface definition includes the data and/or control 
functions and die way in which these functions are represented 
by electrical signals. Specific types of data to be defined are 
listed here: 

1 . Function name and symbol 

2 . Impedance characteristics 


3. Shielding and grounding 

4. Signal characteristics 

5. Cable characteristics 

6. Data definition 

7. Data transmission format, coding, timing, and updating 

8. Transfer characteristics P ° 

9. Circuit logic characteristics 

10. Electromagnetic interference requirements 

1 1 . Data transmission losses 

12 . Circuit protective devices 

Other data types may be needed. For example, an analog 
signal interface document would contain function name and 
symbol, cable characteristics, transfer characteristics, circuit 
protective devices, shielding, and grounding; whereas a digital 
data interface would contain function name and symbol, data 
format, coding, timing and updating, and data definition. 

Additional datatypes under the electrical/functional heading 
are ° 

1. Transmission and receipt of an electrical/electromag- 
netic signal 

2. Use of an electrically conductive or electromagnetic 
medium 

Appendix A shows recommended formats for electrical and 
functional interface control drawings. 

23.2 Mechanical/Physical 

Mechanical/physical interfaces are used to define and con- 
trol the mechanical features, characteristics, dimensions, and 
tolerances of one equipment design that affect the design of 
another subsystem. They also define force transmission re- 
quirements where a static or dynamic force exists. The features 
of the equipment that influence or control force transmission 
are also defined in this ICD. Mechanical interfaces include 
those material properties of the equipment that can affect the 
functioning of mating equipment, such as thermal and galvanic 
characteristics. Specific types of data defined are 

1. Optical characteristics 

2. Parallelism and straightness 

3. Orientation requirements 

4. Space or provisions required to obtain access for perform- 
ing maintenance and removing or replacing items, 
including space for the person performing the function 

5. Size, shape, mass, mass distribution, and center of gravity 

6. Service ports 

7. Indexing provisions 

8. Concentricity 

9. Surface finish 

10. Hard points for handling 
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11. Sealing, pressurization, attachment, and locking 
provisions 

12. Location and alignment requirements with respect to 
other equipment 

13. Thermal conductivity and expansion characteristics 

14. Mechanical characteristics (spring rate, elastic proper- 
ties, creep, set, etc.) 

15. Load-carrying capability 

16. Galvanic and corrosive properties of interfacing 
materials 

Other data types may be needed. For example, an ICD 
controlling a form-and-fit interface would generally contain 
such characteristics as size and shape of the item, location of 
attachment features, location of indexing prov isions, and weight 
and center of gravity of the item. However, an ICD controlling 
a structural load interface would contain weight and center of 
gravity, load-carrying capability, and elastic properties of the 
material if applicable to the loading conditions. Not all ICD s 
controlling a form-and-fit interface would have to contain all 
types of data given in this example, but some form-and-fit 
interface definitions contain more than the 16 types of data 
listed. Indexing definitions may require angularity, waviness, 
and contour definitions and tolerances. 

Additional data types under the mechanical/physical head- 

ing would be 

1. Dimensional relationships between mating equipment 

2. Force transmission across an interface 

3. Use of mechanically conductive media 

4. Placing, retaining, positioning, or physically transporting 
a component by another component 

5. Shock mitigation to protect another component 

Appendix B (from ref. 5) shows a mechanical/physical draw- 

ing. . 

This extensive variety of possibilities and combinations 
prevents assigning a standard set of data types or level of detail 
to a form-and-fit interface. Each interface must be analyzed and 
the necessary controlling data identified before the proper level 
of interface definition and control can be achieved. This holds 
true for all examples given in this chapter. 

2.3.3 Software 

A software interface defines the actions required when 
interfacing components that result from an interchange of 
information. A software interface may exist where there is no 
direct electrical interface or mechanical interface between two 
elements. For example, whereas an electrical ICD might define 
the characteristics of a digital data bus and the protocols used 
to transmit data, a software interface would define the actions 
taken to process the data and return the results of the process. 
Software interfaces include operational sequences that involve 


multiple components, such as data-processing interactions 
between components, timing, priority interrupts, and watchdog 
timers. Controversy generally arises in determining whether 
these relationships are best documented in an electrical/func- 
tional ICD, a software ICD, or a performance requirements 
document. Generally, software interface definitions include 

1 . Interface communication protocol 

2. Digital signal characteristics 

3. Data transmission format, coding, timing, and updating 
requirements 

4. Data and data element definition 

5. Message structure and flow 

6. Operational sequence of events 

7. Error detection and recovery procedures 

Other data types may be needed. Appendix C provides an 
example of a software interface signal. 

2.3.4 Supplied Services 

Supplied services are those support requirements that a piece 
of equipment needs to function. Supplied services are provided 
by an external separate source. This category of interface can be 
subdivided further into electrical power, communication, fluid, 
and environmental requirements. The types of data defined for 
these subcategories are 

1 . Electrical power interface: 

a. Phase 

b. Frequency 

c. Voltage 

d. Continuity 

e. Interrupt time 

f . Load current 

g. Demand factors for significant variations during 
operations 

h. Power factor 

i. Regulation 

j. Ripple 

h. Harmonics 

l. Spikes or transients 

m. Ground isolation 

n. Switching, standby, and casualty provisions 

2. Communication interface: 

a. Types of communication required between equip- 
ment 

b. Number of communication stations per communica- 
tion circuit 

c. Location of communication stations 

3. Fluid interface: 

a. Type of fluid required 

i. Gaseous 

ii. Liquid 
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b. Fluid properties 

i. Pressure 

ii. Temperature 

iii. Flow rate 

iv. Purity 

v. Duty cycle 

vi. Thermal control required (e.g., fluid heat lost or 
gained) 

4. Environmental characteristic interface: 

a. Ambient temperature 

b. Atmospheric pressure 

c. Humidity 

d. Gaseous composition required 

e. Allowable foreign particle contents 

Other data types may be needed. Appendix D shows an ex- 
ample of a supplied services interface for air-conditioning and 
cooling water. 


2.4 Documenting Interfaces 

Once an interface has been categorized and its initial con- 
tents defined, that interface definition must be recorded in a 
document that is technically approved by the parties (designer 
and manager) and the owners of both sides of the interface. The 
document then is approved by the next higher level in the 
project management structure and becomes the official control 
for interface design. 

The program manager must ensure that compliance with the 
approved interface control document is mandatory. Each level 
of program management must ensure that the appropriate 
contractors and Government agencies comply with the docu- 
mentation. Therefore, technical approval of the interface con- 
trol document indicates that the designated approvin° 
organization is ready to invoke the interface control document 
contractually on the approving organization’s contractor or 
supporting organization. 

The interface categories can be grouped together in one 
document, or each category can be presented in a separate 
document (i.e., electrical ICD’s, mechanical ICD’s, etc). The 
format for interface control documents is flexible. In most cases 
a drawing format is the easiest to understand and is adaptable 
to the full range of interface data. 

The specification format (ref. 6) can also be used. The use of 
this type of format enables simple changes through the removal 
and insertion of pages; however, the format is often difficult to 
use when presenting complex interface definitions that require 
drawings, and normally requires many more pages to convey 
the same level of information. 

In either case there must be agreement on a standard for data 
presentation and interpretation. ANSI standard Y14.5 (ref. 7) 
can be used for dimensions, along with DOD-STD-lOO 


(ref. 8), for general guidance of a drawing format The specifi- 
cation format should use MIL-STD-490 (ref. 6) for paragraph 
numbering and general content. 

Some large programs require large, detailed ICD’s. Main- 
taining a large, overly detailed document among multiple 
parties may be more difficult than maintaining a number of 
smaller, more focused documents. Grouping small documents 
by major category of interface and common participants is one 
of the most effective and efficient strategies. It minimizes the 
number of parties involved and focuses the technical disci- 
plines, greatly streamlining the decision process and permitting 
much shorter preparation time. However, interfaces can be 
multidisciplinary and separate documents can result in mis- 
communications. 


2.5 Identifying Steady-State and 
Non-Steady-State Interfaces 

interfaces can vary from a single set that remains constant for 
the life of a program to a multiple set of documents that 
reconfigures during specific events in the life of a system. The 
first category would be used for an interplanetary probe. The 
interfaces of its instruments with the basic spacecraft structure 
would remain the same from assembly for launch throughout 
the life of the experiment. However, a continually evol vine- 
platform, such as a lunar base, would perhaps be controlled in 
a series of interface documents based on the assembly sequence 
of the base. An initial base would be established and later made 
more complex with additional structures and equipment deliv- 
ered by subsequent lunar flights. Pressurized elements, logistic 
elements, power-generating sources, habitats, laboratories, and 
mining and manufacturing facilities might be added and 
reconfigured overtime. Each configuration would require a set 
of interface control documents to ensure compatibility at the 
construction site as well as with the transportation medium 
from Earth to Moon. Interfaces that remained constant during 
this process might be termed steady state and require no further 
consideration once the interface was verified and delivered- 
whereas interfaces that would evolve from the initial 
configuration through multiple iterations would require multi- 
coordination of interface parameters and schedules. The selec- 
tion of interface categories should identify the steady-state or 
non-steady-state nature of interfaces as well as their initial 
designations (ref. 9). 


2.6 Selecting a Custodian 

Selecting an ICD custodian can depend on several factors 
(e.g., percentage of interface ownership, relative mission im- 
portance of interface sides, and relative investment of interface 
sides). However, it is generally most effective if the custodian 
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selected has an objective point of view. An example of this 
would be someone who is independent of either side of the 
interface (i.e., without any “vested interest” in the interface 
hardware or software). Objectivity permits unbiased control of 
the interface, involvement of the custodian as an objective 
mediator, and documentation of the interface on a noninterfer- 
ence basis with program/project internal design. Selecting an 
independent interface custodian should be the first step in 
establishing an interface control organization. A set of criteria 
should be used to select the custodian by weighting the content 
and interests of the interface with the needs of interface control . 
One set of criteria is as follows: 

1 . Integration center: Is one center accountable for integrat- 
ing the interfaces controlled by this ICD? This criterion is 
considered the most important because the integration center 
will have the final responsibility for certifying flight readiness 
of the interfaces controlled in the ICD. 

2. U.S. center: Is the participant a U.S. center? This crite- 
rion is considered the next most important because of agency 
experience and projected responsibility. 

3. Flight hardware or software: Is the interfacing article 
flight hardware or software (as opposed to support hardware or 
software)? Flight hardware or software takes precedence. 

4. Flight sequence: Does one side of the interfacing equip- 
ment fly on an earlier manifest than the other? An earlier flight 
sequence takes precedence over follow-on interfacing 
hardware. 

5. Host or user: Is the interfacing article a facility (as 
opposed to the user of the facility)? Procedure in this criterion 
is guided by the relative priority of the interfacing articles. 

6. Complexity: How complex is the interfacing equipment 
(relative to each side)? The more complex side of the interface 
normally takes precedence. 

7. Behavior: How active is the interfacing equipment? The 

active side normally takes precedence over the passive side. 

8. Partitions: How are the partitions (categories) used by the 
interfacing equipment? The relative importance of the parti- 
tions to the interface is acknowledged, and selection of the 
custodian is sensitive to the most important partition 
developers. 

Scores are assigned to each piece of interfacing equipment 
for each criterion. These scores can be determined by many 
different methods. Discrete values can be assigned to the first 
four criteria. A score of 1.0 is assigned if the interfacing piece 
of equipment is unique in meeting the criterion, the other piece 
of equipment then receives a score of 0.0. Scores of 0.5 are 
assigned to both sides if both (or neither) of them meet the 
criterion. There is no definitive way of assigning scores to the 
last four criteria; however, verbal consensus or an unbiased 
survey can be used to assign scores. Also, the partition criteria 
can be scored by partition evaluation analysis (ref. 4). 


2.7 Analyzing for Interface 
Compatibility 

The interface definitions to be documented on the ICD s 
must be analyzed for compatibility before the ICD is authenti- 
cated. Appendix E provides guidance on how compatibility 
analyses may be performed. They vary in their complexity from 
a simple inspection of the interface definitions to complex 
mathematical analyses where many variables are involved. 

Regardless of complexity, the compatibility analysis should 
be documented and maintained as backup information for the 
ICD. It can be used to expedite any changes to the interface 
definition by providing a ready means for evaluating the 
compatibility of the proposed change. The compatibility analy- 
sis also can be used to document how the interface definition 
was arrived at and why the definition is presented as it is on 
an ICD. 

2.8 Verifying Design Compliance With 
Interface Control Requirement 

The ICD can only fulfill its purpose if the contractors’ 
detailed design drawings and construction practices adhere to 
the limits imposed by the ICD. Verifying compliance of the 
design as well as of the construction process is an integral part 
of interface control. 

Each contractor should be assigned the responsibility of 
denoting on their manufacturing and inspection drawings or 
documents those features and characteristics that, if altered, 
would affect interfaces controlled by the ICD’s. To ensure that 
all ICD requirements are covered, the contractor should select, 
at the highest assembly level at which the equipment is in- 
spected, the features and characteristics to be denoted. Any 
design change affecting an ICD-controlled feature or charac- 
teristic would be clearly identified even at the assembly level 
(ref. 10). 

Entries identified as “to be resolved” (TBR) can be bracketed 
or shaded to indicate preliminary interface information or an 
interface problem. This information is subject to further review 
and discussion and is an interim value for use in evaluating 
effects. Entries identified as “to be supplied” (TBS) represent 
data or requirements to be furnished. Appendix F shows a 
typical bracket system. 


2.9 Verifying Contract-Deliverable Item 

Each contract-deliverable item that is a mating side to an ICD 
interface should also be tested or measured to verify that the 
item complies with the requirement as specified in the ICD. The 
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responsibility for administering and reporting on this verifica- 
tion program could be assigned to the design agent, the contrac- 
tor, or an independent third party. If feasible, an independent 
third party should be selected for objectivity. 

The verification methods should include analysis, measure- 
ment and inspection, demonstration, and functional testing. 
The specific methods employed at each interface will depend 
on the type of feature and the production sequence. Compliance 
should be verified at the highest practical assembly level. To 
preclude fabrication beyond the point where verification can be 
performed, an integrated inspection, measurement, and dem- 
onstration test outline of both hardware and software should be 
developed. This verification test outline will provide a sched- 
ule, tied to production, that allows all interface requirements to 
be verified. The resultant data and inspection sheets should 
become part of the verification data in the history jacket 
retained by the contractor for NASA. 


2.10 Training 2 

1 . What is the purpose of interface control? 

A. To define interfaces 

B. To ensure compatibility between interrelated equip- 
ment 

C. To provide an authority to control interface design 

D. All of the above 

2. How is an interface identified? 

A. By boundaries between functional areas 

B. By functional analyses of performance requirements 

C. By design features of a component that can affect the 
design features of another component 

3a. How can interfaces be categorized? 

A. Mechanical, electrical, software, and services 

B. Electrical/functional, mechanical/physical, software, 
and supplied services 

C. Electrical, physical, software, and supplies 

3b. What is one method of assigning an interface to one of the 
four basic categories? 

A. Functional flow block diagram 

B. Timeline analysis 

C. ^-squared diagram 

4a. How can an interface be documented? 

A. By drawing format 

B. By specification format 

C. By both of the above 


4b. Who approves the interface control document? 

A. Designer or manager 

B. Owners of both sides 

C. Both of the above 

4c. Who ensures compliance with the approved ICD? 

A. Designer or manager 

B. Owners of both sides 

C. Project manager 

5a. What is a steady-state interface? 

A. A single set that remains constant for the life of the 
project 

B. A multiple-set suite that reconfigures during specific 
events in the life of the system 

5b. Give an example of a steady-state interface. 

A. An interplanetary probe 

B. A lunar base 

5c. What features make this a good example of a steady-state 
interface? 

A. The basic structure of the spacecraft would remain the 
same from assembly for launch throughout the life of 
the experiment. 

B. An initial base would be established and subsequently 
made more complex with additional structures and 
equipment delivered by subsequent lunar flights. 

6a. How should an ICD custodian be selected? 

A. Percentage of ownership of the interface 

B. Relative investment of interface sides 

C. An objective point of view 

6b. What criteria should be used to select a custodian? 

A. Integration or U.S. center, flight hardware or software, 
flight sequence, host or user, complexity, behavior, 
and partitions 

B. Integration hardware, sequence user, and partitions 

6c. What scoring system can be used for these criteria? 

A. Zero to 1.0, verbal consensus, unbiased survey, and 
partition evaluation analysis 

B. One to 100, priority ranking, and voting 

7a. What is the purpose of an ICD compatibility analysis? 

A. Demonstrates definitions and provides mathematical 
analysis 

B. Demonstrates completeness of an interface definition 
and provides a record that the interface has been 
examined and found to be compatible 
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7b. What are the four categories that require interface 
analysis? 

A. Electrical/functional, mechanical/physical, supplied/ 
services, and hydraulic/pneumatic 

B. Electrical/functional, mechanical/physical, software, 
and supplied services 

7c. The hardware for mounting the satellite vehicle (SV) 
adapter to the Titan IV Centaur is shown in figures 2.1 

to 2.3. ..... 

A. Is there sufficient data to perform a compatibility 

analysis? 

i. Yes ii- No 

B. Can the Jet Propulsion Laboratory specify the SV 
adapter ring? 

i. Yes ii- No 

C What items need to be bracketed? 

i. Shear pin material and SV attachment view 

ii. SV panel and view C— C 

8a. What does a bracket on an ICD represent? 

A. Verification of design compliance 

B. An interface problem 


8b. What interface deficiency rating does a bracket discrep- 
ancy have? 

A. S & MA impact A > 1 or understanding of risk B > 2 

B. S & MA impact A < 1 or understanding of risk B < 2 

9a. How are mating sides of an ICD interface verified? 

A. Testing or measurement to meet requirements 

B. Analysis, measurement or inspection, demonstration, 
and functional testing 

9b. What does the verification test outline provide? 

A. Schedule, tied to production, that allows interface 
requirements to be verified 

B. Process controls, tied to manufacturing, for meeting 
schedules 

9c. Where is the resultant test and inspection data stored? 

A. Contractor files for use by an independent third party 

B. History jackets for use by NASA 
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Figure 2.1. -Titan IV and satellite vehicle physical/envelope interfaces. 







NASA RP-1370 


11 


Figure 2.2.— Titan IV and satellite vehicle orientation. 
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Figure 2.3.-Titan IV and satellite vehicle adapter ring. 






















Chapter 3 

The Process: Through the Design Phases 


Interface control should be started when a program begins. 
This process eventually will define all interface design an 
documentation responsibilities throughout the life cycle of the 
program. Each program phase from concept development to 
project construction is directly related to the maturity level ot 
interface control. 


3.1 Program Phases 


3.1.1 Concept Definition 


During the system engineering concept definition phase 
(from fig. 1.1), basic functional areas of responsibility are 
assigned for the various pieces of equipment that will be 
employed by the project (electrical power, environment con- 
trol, propulsion, etc.); see figure 3.1. At this point the design 
responsibilities of the responsible organization and related 
contractor (if chosen) should be defined to establish a set of 
tiered, traceable requirements. From these requirements the 
interfaces to be designed are identified by category (electrical/ 
functional, mechanical/physical, software, and supplied ser- 
vices) and by type of data that must be defined. This categori- 
zation will include a detailed review of each requirement to 
determine which requirements or features will be controlled by 
the interface control process. (What is important for this item to 
fulfill its intended function? On what interfacing equipment is 
this function dependent?) Once the interfaces to be controlled 
are selected, the formal procedures for interface control need to 
be established. These procedures include identifying the par- 



Concept 

definition 


• Assign basic functional areas of responsibility. 

• Define design responsibilities. 

• Categorize interfaces. 

• Define interfaces to be controlled. 

• Establish formal interface control procedures. 

• Disseminate scheme, framework, traceability. 

Figure 3.1 . — Establishment of interface control 
process during concept definition. 


ticipants responsible for the interface control documentation, 
the approval or signoff loop for documentation, and the degree 
to which all participants have to adhere to interface control 
parameters and establishing a missing design data matrix, 
change procedures, etc. (See section 3.2.) 

Early development of the interface process, products, an 
participants provides a firm foundation for the design engineer 
to use the correct information in designing his or her portion o 
an interface. It minimizes the amount of paper to be reviewed, 
shortens the schedule, and concentrates the efforts of the 

designer on his or her area of responsibility. 

Initial selection of interfaces generally begins with listing o 
all pieces of equipment in a system and then identifying the 
extent of interrelation among them. One tool used to help in this 
process is the /V-squared diagram. Details of this process can be 
found in reference 4. The A-squared diagram was initially used 
for software data interfacing; however, some centers are using 
it for hardware interfaces. If the diagram is not polarize 
initially (input/output characteristics not labeled), it is a conve- 
nient format for identifying equipment interfaces and for cat- 
egorizing them. An example of this form is shown in figure 3.2. 
This diagram can be further stratified to identify the interfaces 
for each of the categories; however, detailed stratification is 
best applied to electrical/functional, software, and supplied 
services interfaces. Using the N - squared diagram permits an 
orderly identification and categorization of interfaces that can 
be easily shown graphically and managed by computer. 

By the end of this phase the basic responsibilities and 
management scheme, the framework for the interface control 
documentation, and the process for tracking missing interface 
design data (see section 3.2.2) should be established and 
disseminated. 

3.1.2 Requirements Definition 

During the requirements definition phase (fig. 3.3; from 
fig. 1 . 1 ), the definitions of the mission objectives are completed 
so that each subsystem design can progress to development. 
Here, the technology to be used in the project will be defined to 
limit the risk associated with the use of new, potentially 
unproven technologies. Defining requirements and baselining 
interface documents early in the design process provides infor- 
mation to the designer needed to ensure that interface design 
is done correctly the first time. Such proactive attention to 
interfaces will decrease review time, reduce unnecessary 
paperwork, and shorten schedule times. By the end of require- 
ments definition all interface control documents should be 
prepared, interfaces defined to the most detailed extent pos- 
sible, and ICD’s presented for baselining. 
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Figure 3.2.-A/-squared diagram for orbital equipment. (Entries not polarized.) 



definition 

• Define technologies to be used. 

• Define and categorize all interfaces. 

• Prepare all interface control documents. 

• Identify all voids and assign both 
responsibilities and due dates. 

• Bound voids when possible. 

• Baseline interface documents. 

Figure 3.3. — Development and control of 
interfaces during requirements definition. 


Baselining is the act by which the program manager or 
designated authority signs an ICD. That signature establishes 
the ICD as an official document defining interface design 
requirements. The term “baselining” is used to convey that the 
ICD is the only official definition and that this officiality comes 
from the technical management level. Not only is the initial 
version of the ICD baselined, but each subsequent change or 
update to an ICD is also baselined. 

The baselined version of the ICD will identify (by a “void”) 
any missing design data that cannot be included at that time. 
Agreed-to due dates will be noted on the ICD for each data 
element required. Each void will define the data required and 
specify when and by whom such data will be supplied. Where 
possible, the data to be supplied should be bounded initially on 
the ICD. These bounds will be replaced by detailed data when 
the void is filled. The initial bounds give the data user (the other 
side of the interface) a range that can be used without risk, until 
the detailed data are supplied. Establishing these voids on 
ICD’s provides a means of ensuring that interface design data 
are supplied when they are required by the data user. Yet it 
allows design freedom to the data supplier until the data are 
needed. A recommended form for use in identifying the data 
needed is shown in figure 3.4. The criteria for choosing due 
dates are discussed in section 3.2. 


14 


NASA RP-1370 


I 



Interface Design Data Required (IDDR) 

(Drawing/document number + Void number) 

Data required: Brief description of inform ation needed 

to defin e interface element currently 

lacking details 

Data supplier (Project center/code/contractor) 

Data gser(s): (Project center/code/contractor) 


Date due: (Date design data are needed, either actual 

date or a period of tirne related to a sp ecific 

milestone: 


Figure 3.4. — Format for interface design data required (IDDR). 



Interface Design Data Required (IDDR) 
Program Status Report 


Drawing/doc # 
IDDR# 

Sheet/page 

Short title 

Suppliers) 

User(s) 

Due date 

Remarks i 

i 

i 

/ Zone 

Data required 

Center/code/ 

contractor 

Center/code/ 

contractor 

Yr/Mo/Day 







; 

i' 

• |l 

!• 





! 



Jl 


Figure 3.5.— Format for monthly report on IDDR status. 


Documents should be baselined as early as possible, as soon 
as the drawings contain 10% of the needed information. The 
significance of early baselining is that both sides of the interface 
have the latest, most complete, official, single package of 
information pertaining to the design of the interface. 

The package includes all agreed-to design data plus a list of 
all data needed, its current level of maturity, and when it is to 
be supplied by whom to whom. 


Technical information voids in interface documents must be 
accounted for and tracked. Otherwise, there is no assurance that 
the needed information is being provided in time to keep the 
design on schedule. The status of these voids must be reported, 
and the owners of the interface-design-data-required forms 
(IDDR’s) must be held responsible for providing the needed 
information. It is recommended that the status be reported 
monthly to all parties having responsibility for the interfaces. 
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A consolidated report is the most efficient, consumes the least 
paper and mail services, and allows the program manager to 
track areas important to the integration of the system compo- 
nents. The basic form shown in figure 3.5 is recommended for 
reporting and tracking IDDR’s. 

3.1.3 Systems Integration 

The interface control program continues to be active during 
the systems integration phase (fig. 3.6; from fig. 1.1). Design 
changes that establish a need for a new interface will follow the 
interface control change procedures as defined in section 3.2. 

Proposed design changes that affect existing interfaces are 
not given final approval until all participants’ and the cognizant 
center’s baselinings have been received through the ICD change 
notice system. 

During the various design reviews that occur in the full-scale 
engineering development phase, special attention should be 
given to design parameters that if altered, would affect inter- 
faces controlled by the ICD. It is strongly recommended that 
each design activity denote, on design and manufacturing 
documentation at the preliminary design review, through a 
bracket or some highlighting system, those features and char- 
acteristics that would affect an interface (see section 2.8). At the 
critical design review all voids should be resolved and all 
detailed design drawings should comply with interface control 
documentation (see section 2.9). 



• Manage and satisfy voids. 

• Invoke use of brackets on design drawings. 

• Ensure resolution of voids by the time of critical 
design review. 

• Verify compliance of design documentation with 
ICD’s. 

Figure 3.6.— Development and control of interfaces 
during systems integration. 


3.2 Preparing and Administering 
Interface Control Document 

3.2.1 Selecting Type of Interface Control Document 

A drawing, a specification, or some combination format 
should be selected for the ICD on a case-by-case basis. The 
drawing format generally is preferred when the ICD has fea- 
tures related to physical dimensions and shapes. The specifica- 
tion format is preferred when the ICD needs tables and text to 
describe system performance. Combinations are used when 
both dimensions and tables are needed. Members of the 
coordinating activity responsible for preparing the ICD deter- 
mine the format, which is approved by the appropriate project 
authority. Examples of drawing formats are given in appen- 
dixes A and B. 

The level of detail shown on the ICD varies according to the 
type and degree of design dependency at the interface being 
controlled. The ICD should clearly identify and control inter- 
faces between designs and enable compatibility to be demon- 
strated between the design areas. The key to a useful ICD is 
limiting the detail shown to what is required to provide compat- 
ibility. Any unnecessary detail becomes burdensome and may 
confuse the contractors responsible for designing the mating 
interface. Again, the ICD should, at a minimum define and 
illustrate physical and functional interface characteristics in 
sufficient detail that compatibility, under worst-case toler- 
ances, can be determined from the ICD alone; or it should 
reference applicable revisions of detailed design drawings or 

documents that define and bracket or identify features, charac- 
teristics, dimensions, etc., under worst-case tolerances, such 
that compatibility can be determined from the bracketed 
features alone. 

3.2.2 Tracking and Resolving Missing Interface 
Design Data 

Missing interface data should be identified on the ICD, and 
the ICD should control the date for its submission. The notation 
identifying the missing data should indicate the specific data 
required, how the data are being tracked for resolution, when 
the data are needed by the interfacing design agent, and by what 
date the required data will be supplied. Establishing data- 
required notations (or voids) on ICD’s helps ensure that inter- 
face design data will be supplied when needed; yet it allows 
design freedom to the data supplier until the due date. Every 
attempt should be made to establish realistic due dates and to 
meet that schedule unless there is a valid and urgent need to 
change a due date. 

These criteria and procedures should be followed in estab- 
lishing, reporting, and managing data due dates: 
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1. Choose the due date as the date when the data user will 
start to be affected if agreed-upon or baselined data have not 
been received. 

2. When establishing a due date, allow time to process and 
authenticate a change notice to the ICD (i.e., once the due date 
has been established, include a period of time to establish that 
due date for the data supplier). 

3. The custodian responsible for the ICD should periodi- 
cally, as determined by the appropriate project authority, 
prepare and distribute areport on the status of all missing design 
information for all project activities. The report should contain 
the following information: 

a. Identification of the data element needed, consisting of 
the ICD number, the date, and a two- or three-digit 
number that provides a unique identifier for the data 
element 

b. A short title for the ICD 

c. The activity that requires the data 

d. The date when the missing data are to be supplied or 
the period of time after the completion of a program 
event or milestone when the data are to be supplied 

e. The activity from which the data are due 

f. The status of the data required (i.e., late data, data in 
preparation, or change notice number) 

g. A description of the data required 

3.3 Initial Issuance of ICD 

The first issue of an ICD should be a comment issue. The 
comment issue is distributed to participating centers and con- 
tractors for review and comment as designated in the interface 
responsibility matrix (fig. 3.7). 

The interface custodian generates the responsibility matrix 
for ICD’s. The matrix specifies the center and contractor 
responsibilities for baselining, review and comment, and tech- 
nical approval. The matrix lists all ICD’s applicable to a 
particular program. Distribution of the ICD s can then be 
controlled through this matrix as well. 

The review and comment process is iterative and leads to 
agreement on system interface definitions and eventual approval 
and baselining of the ICD. See figure 3.8 for a flow diagram of 
the issuance, review and comment, and baselining procedures 
for ICD’s. Concurrent distribution of the comment issue to all 
participants minimizes the time needed for review and subse- 
quent resolution of differences of opinion. 

3.4 Document Review and Comment 

As designated in the ICD responsibility matrix, all centers 
and contractors should submit technical comments through the 


appropriate authority to all other activities with review and 
comment responsibilities for the particular ICD and to the ICD 
custodian. 

Technical comments by all activities should be transmitted 
to the custodian as soon as possible but not later than 30 
working days 4 from receipt of the comment issue. If the 
comment issue is technically unacceptable to the Government 
authority or the interfacing contractor, the rationale for 
unacceptability should be explained, including technical and 
cost effects if the interface definition is pursued as presented. 

3.4.1 Resolving Comments 

The ICD custodian collects review comments and works in 
conjunction with project management for comment resolution 
until approval is attained, the comment is withdrawn, or the 
ICD is cancelled. Information on comments and their disposi- 
tion and associated resolution should be documented and 
transmitted to all participants after all comments have been 
received and dispositioned. Allow two weeks 4 for participants 
to respond to the proposed resolution. Nonresponses can be 
considered concurrence with the resolutions if proper 
prenotification is given to all participants and is made part of the 
review and comment policy. 

When comments on the initial comment issue require major 
changes and resolution is not achieved through informal com- 
munications, an additional comment issue may be required 
and/or interface control working group (ICWG) meetings may 
need to be arranged. 

3.4.2 Interface Control Working Group 

The ICWG is the forum for discussing interface issues. 
ICWG meetings serve two primary purposes: to ensure effec- 
tive, detailed definition of interfaces by all cognizant parties, 
and to expedite baselining of initial ICD’s and subsequent 
drawing changes by encouraging resolution of interface issues 
in prebaselining meetings. A major goal of interface control 
should be that baselining immediately follow a prebaselining 
ICWG meeting. 

All ICWG meetings must be convened and chaired by the 
cognizant project organization. The project can choose a con- 
tractor to act as the chair of an ICWG when Government 
commitments are not required. In all cases the ICWG members 
must be empowered to commit the Gov ernment or contractor to 
specific interface actions and/or agreements. In cases where a 
contractor is ICWG chair, the contractor must report to the 
Government any interface problems or issues that surface 
during an ICWG meeting. 


4 The times assigned for commenting activities to respond are arbitrary and 
should be assigned on the basis of the schedule needs of the individual 
programs. 
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Figure 3.8.— Flow of interface control document production. 


The ICWG chair prepares the ICWG meeting minutes or 
designates one of the meeting participants for this task. The 
minutes should include discussions of problems, agreements 
reached, decisions made, and action items. The ICWG chair 
also ensures that any updated interface control documentation 
reflecting the ICWG discussions is distributed within the 
timeframe agreed to by the affected participants. 

3.4.3 Approval/Signoff Cycle 

The management plan for the project assigns responsibility 
for each piece of equipment to a specific project authority and 
its contractor. The signoff loop for each ICD reflects this plan 
and can be related to the project and the origin of each design 
requirement. For each ICD, then, the signoff loop follows the 
sequence of technical approval by the contractors first and then 
by the appropriate project authority. 

3.4.4 Technical Approval 

The appropriate project authority and the primary and asso- 
ciate organizations with an interest in a particular ICD are listed 
in the responsibility matrix. They each sign the ICD to signify 
technical agreement and a readiness to contractually invoke its 
requirements. 


3.4.5 Baselining 

Interface control documents are baselined when the owners 
of both sides of the interface at the next level up in the program 
structure come to technical agreement and sign the document. 


3.5 Change Notices 

The procedure for initiation, review, technical approval, 
baselining, and distribution of changes to project ICD s 
(fig. 3.9) should conform to the following guidelines. 

3.5.1 Initiating Changes 

Any project activity should request a change to an ICD when 

1 . Data are available to fill a void. 

2. Information contained in a data-required note needs to be 
modified. 

3. Additional data are needed (i.e., a new data requirement 
has been established). 

4. A technical error is discovered on the ICD. 
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Figure 3.9.— Development and flow of change notices in the ICD revision process. 
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5. An equipment design change and a system or equipment 
rearrangement are proposed to improve performance, 
reduce cost, or expedite scheduled deliveries that would require 
changes to an interface or creation of new interfaces. 

3.5.2 Requesting Changes 

All requests for changes should be submitted to the organi- 
zation responsible for maintaining the ICD, with copies to all 
activities that will review the resultant change notices and to the 
appropriate project authority. If baselining is needed in less 
than 30 days, a critical change should be requested. All requests 
for changes should be submitted in a standard format that 
includes the following items: 

1 Originator’s identification number — It is used as a refer- 
ence in communications regarding the request and should 
appear on resulting change notices 

2. Originating activity — originating project and code or 
originating contractor 

3. Point of contact — name, area code, telephone number, 
facsimile number, and e-mail address of the person at the 
originating activity to be contacted regarding the request 

4. Document affected — number, revision letter, and short 
title of each ICD that would be affected by the change 

5. Number of data voids (if applicable) — number of data 
requirements for which data are being provided 

6 . Urgency — indication of whether this change is critical or 
routine (project decides whether to use critical route) 

7. Detailed description of change — a graphic or textual 
description of the change in sufficient detail to permit a clear 
portrayal and evaluation of the request. Separate descriptions 
should be provided when more than one ICD is affected. 

8. Justification — concise, comprehensive description of the 
need and benefit from the change 

9 impact — concise, comprehensive description of the ef- 
fect in terms of required redesign, testing, approximate cost, 
and schedule effects if the requested change is not approved; 
also the latest date on which approval can occur and not affect 
cost or schedule 

10. Authorizing signature of the organization requiring the 
change 

Upon receipt of a change request to an ICD, the ICD 
custodian coordinates the issuance of a proposed change notice . 
First, the ICD custodian evaluates the technical effect of the 
proposed change on the operation of the system and mating 
subsystem. If the effect of the change is justified, the ICD 
custodian generates and issues a change notice. If the justifica- 
tion does not reflect the significance of the change, the ICD 
custodian rejects the request, giving the reason or asking for 
further justification from the originating organization. The ICD 
custodian evaluates an acceptable change request to determine 
whether it provides data adequate to generate a change notice. 


The proposed change notice describes the specific changes 
(technical or otherwise) to the ICD in detail by “ffom-to 
delineations and the reasons for the changes, as well as who 
requested the changes and how the change request was trans- 
mitted (i.e., by letter, facsimile, ICWG action item, etc.). 

3.5.3 Proposed Change Notice Review and 
Comment Cycle 

The review and comment cycle for proposed changes to 
ICD’ s should follow the same system as that used for the initial 
issuance of the ICD (see sections 3.3 and 3.4). 

3.5.4 Processing Approved Changes 

The baselined change notice should be distributed to all 
cognizant contractors and project parties expeditiously to prom- 
ulgate the revised interface definition. The master ICD is 
revised in accordance with the change notice, and copies of the 
revised sheets of the ICD are distributed (see sections 3.3 and 
3.4). Approval of the change by the project constitutes author- 
ity for the cognizant organization to implement the related 
changes on the detailed design. 

3.5.5 Distributing Approved Changes 

The custodian distributes the baselined change notice to all 
cognizant centers and contractors to expeditiously promulgate 
the revised interface definition. The master ICD is then revised 
in accordance with the change notice, and copies of the revised 
ICD sheets are distributed as was the change notice. 
The responsibility matrix (fig. 3.7) can be used to identify the 
distribution of change notices as it was used for the distribution 
of the ICD’s. 

3.5.6 Configuration Control Board 

During development the project’s configuration control 
board is responsible for reviewing and issuing changes to the 
configuration baseline. The board reviews all class I engineer- 
ing change proposals to determine if a change is needed and to 
evaluate the total effect of the change. The configuration 
control board typically consists of a representative from the 
chairman, the project management office, customers, engineer- 
ing, safety assurance, configuration management (secretary), 
fabrication, and others as required. 

Changes to configuration items can only be effected by the 
duly constituted configuration control board. The board first 
defines a baseline comprising the specifications that govern 
development of the configuration item design. Proposed changes 
to this design are classified as either class I or class II changes. 
Class I changes affect form, fit, or function. However, other 
factors, such as cost or schedule, can cause a class I change. 
Class I changes must be approved by the project before being 
implemented by the contractor. 
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All other changes are class II changes. Examples of class II 
changes are editorial changes in documentation or hardware 
changes (such as material substitution) that do not qualify as 
class I changes. Project concurrence, generally, is required for 
the contractor to implement class II changes. Government plant 
representatives (Defense Contracts Administration Services 
(DCAS), Navy Programs Resident Office (NAVPRO) and Air 
Force Programs Resident Office (AFPRO) usually accomplish 
these tasks. 

3,5.7 Closing the Loop 

A wide range of methods are available for verifying by test 
that the design meets the technical requirements. During the 
definition phase analysis may be the only way of assessing what 
is largely a paper design. Typical methods are testing by 
similarity, analysis, modeling, and use of flight-proven compo- 
nents; forecasting; and comparison, mathematical modeling 
simulation modeling, and using flight-proven experience and 
decisions. The actual methods to be used are determined by the 
project office. Each method has associated costs, requires 
development time, and provides a specific level of performance 
verification. The Government and industry managers must 
carefully trade off program needs for performance verification 
with the related costs. 

If any demonstrated or forecast parameter falls outside the 
planned tolerance band, corrective action plans are prepared by 
the contractor and reviewed by the Government project office. 
Each deviation is analyzed to determine its cause and to assess 
the effect on higher level parameters, interface requirements 
and system cost effectiveness. Alternative recovery plans are 
developed showing fully explored cost, schedule, and technical 
performance implications. Where performance exceeds re- 
quirements, opportunities for reallocation of requirements and 
resources are assessed. 

Although functional and performance requirements are con- 
tained in the appropriate configuration item specification, the 
definition, control, and verification of interface compatibility 
must be handled separately. Otherwise, the volume of detail 
will overwhelm both the designers and managers responsible 
for meeting the functional and performance requirements of the 
system. Early establishment of the interface definition and 
control process will provide extensive savings in schedule 
manpower, money, and paper. This process will convey pre- 
cise, timely information to the interface designers as to what the 
designer of the opposing side is committed to provide or needs 
and will subsequently identify the requirements for verifying 
compliance. & 

Whether the interface is defined in a drawing format or in a 
narrative format is at the discretion of the program. What is of 
primary importance is that only the information necessary to 
define and control the interface should be on these contractual 
documents to focus the technical users and minimize the need 
for updating information. 


Appendix G provides seven ICD guidelines that have been 
used by many successful flight projects and programs to pro- 
vide such a focus on the interface definition and control 
process. 


3.6 Training 2 

la. When should the ICD process be started? 

A. Concept definition B. Requirements definition 
C. Systems integration 

lb. What are the benefits of early development of the ICD 
process? 

A. Assigns basic areas of responsibility 

B. Provides firm foundation for design, minimizes 
paper, shortens schedule, and concentrates efforts 

lc. What tool can be used to list equipment and identify their 
interrelations in a system? 

A. Prechart B. N-squared diagram 

2a. What should be done in the ICD process during require- 
ments definition? 

A. Define mission objectives 

B. Define technology and interfaces and present for 
baselining 

2b. What is baselining? 

A. The designated authority signing an ICD 

B. The only official definition 

2c. How are voids in an ICD accounted for and tracked? 

A. Procedure or administration report 

B. Monthly program status report on interface design 
data required 

3a. What should be done in the ICD process during develop- 
ment? 

A. Manage voids, invoke brackets, resolve voids, and 
verify compliance 

B. Control interface developments 

3b. How should proposed design changes be handled? 

A. Discussed at critical design review 

B. Discussed and approved by all participants 

3c. What should be given special attention? 

A. Design parameters that affect controlled ICD 

B. Manufacturing documentation 


2 

Answers are given at the end of this manual. 
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4a. When is the drawing format used for an ICD? 

A. To describe the type and nature of the component 

B. To describe physical dimensions and shapes 

4b. When should a specification be used? 

A. To describe performance with tables and text 

B. To describe a software function 

4c. What is the key to providing a useful ICD? 

A. Give as much detail as possible 

B. Limit the detail to what is necessary to demonstrate 
compatibility 

5a. What is the purpose of the initial issue of an ICD? 

A. Issuance, review, comment, and baselining 

B. Review and resolution of differences of opinion 

5b. Who is responsible for controlling the flow of an ICD? 

A. Contractor 

B. Custodian 

6a. Who should review ICD’ s? 

A. Organizations designated in the responsibility 

matrix 

B. ICD custodian 

6b. How are comments resolved? 

A. By the project office 

B. By project management and custodian working for 
resolution and approval or the comment being with- 
drawn 

6c. Where are interface issues discussed? 

A. Project office 

B. Interface control working group 


6d. Who approves and baselines an ICD? 

A. Projects at the next level up in program structure 

B. The project office 

7a. When should a project activity request a change to an ICD? 

A. At the custodian’s request 

B. When data are available, requirements need change, 
an error is discovered, or the design changes 

7b. What items should be included in a change notice request? 

A. Identification number, activity, contact, document 
affected, number of data voids, urgency, descrip- 
tion, justification, impact, and authorizing signature 

B. Those established by the ICWG 

7c. Who evaluates and issues a proposed change notice? 

A. ICD custodian 

B. Project office 

7d. What does a proposed change notice describe? 

A. Specific changes (from-to), reasons, and the 
requestor 

B. Project notices 

7e. How is a change notice approved and distributed? 

A. By the project authority to all cognizant parties 

B. By all cognizant parties to the contractors 


National Aeronautics and Space Administration 
Lewis Research Center 
Cleveland, Ohio, 44135, July 1995. 
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Appendix A 

Electrical/Functional Interface Example 


This appendix illustrates elements of a telemetry drawing 
interface control document showing control of waveform 
parameters and data rates. This interface example depicts data 
transfer between a guidance system electronics assembly and a 
launch vehicle telemetry system. The basic drawing (fig. A.l) 
covers the isolation elements of the guidance system, the jack 
and pins assigned, and shielding and grounding on the guidance 
side of the interface. Bus functions are named (e.g., guidance 
telemetry data 1 (parametric)), and the shielding requirements 
through to the first isolating elements of the telemetry system 
are provided (see notes on fig, A.l). 

Table A. l contains the details to be controlled for each bus 
function. Signal source (electronics assembly) and destination 
(telemetry system) are identified. The waveform (fig. A.2) and 
its critical characteristics (table A.2) are provided, as well as 
data rates and sources and load impedances. Telemetry load 
impedance is further described by an equivalent circuit (see 
note 3 on fig. A.l). 

The final value of pulse minimum amplitude is missing in 
this example. This is noted by the design-data-required (DDR) 


callout in table A.2 and the accompanying DDR block (fig. 
A.3). The DDR block notes that the responsible parties have 
agreed on an amplitude band with which they can work until the 
guidance design becomes firm. However, there is also a date 
called out that indicates when (45 daysafterpreliminary design 
review) the telemetry contractor must have the data to be able 
to complete design and development and deliver the telemetry 
in time to support launch vehicle flight. 

The parameters called out in this example are only those 
needed to control the design of either side of the interface 
through the first isolating element. Also note that only the 
shielding and wire gage of the launch vehicle cabling between 
the two systems are provided. Only pin numbers for the 
guidance side of the interface are called out and controlled. 
Connector types and other pertinent cable specifications are as 
per a referenced standard that applies to all launch vehicle 
cabling. In this case the same pulse characteristics apply to each 
of the functions covered; however, table A.2 is structured to 
permit variation for each function if the design should dictate 
different values for the characteristics of each function. 
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Figure A.1 .—Guidance/launch vehicle telemetry interface. 













Table A.1 .-GUIDANCE/LAUNCH VEHICLE TELEMETRY 



26 


NASA RP-1370 


i 



Pulse duration 



I iWICO. 

1 . The interpulse period shall be the period from 1 50 ns after the trailing edge of 
a pulse until 100 ns prior to the leading edge of the subsequent pulse. 

2. The reference level shall be the average voltage for the last 200 ns of the 
interpulse period. 

3. The no-transmission level shall be 0 V differential at the guidance/launch vehicle 
interface using the test load specified in table A.2. 

4. Shielding depicted represents the telemetry shielding requirements only. For 
cable routing see void #01 . Telemetry shielding shall be earned through all 
connectors between the electronic assembly and the telemetry subsystem. 

5. A radiofrequency cap shall be provided on electronic assemblies in all launch 
vehicles in lieu of this connector. 

Figure A.2. — Guidance data pulse characteristics. 
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Table A.2.— REQUIRED PULSE CHARACTERISTICS AND TEST PARAMETERS 


Pulse 

characteristics 



Guidance telemetry 

i— — 

(see fig. A.2) 

Data 1 

Data 2 

Bit 

synchronization 

Frame 

synchronization 

Data 1 word 
synchronization 

Data 2 word 
synchronization 

Pulse duration 
Minimum amplitude 
Maximum amplitude 
Rise time 
Fall time 
Undershoot 
Reference level offset 
Noise 

Receiver susceptibility 
Test parameters: 

Test load 
Receiver 
susceptibility 

255 + 50 ns 
9 ± 2 V (see V027) 

15 V 

75 ns maximum 
1 25 ns maximum 
2.5 V maximum 

0 to -4.5 V relative to no-transmission level 
1 .4 V maximum peak to peak 
2.0 V minimum 

75 ft±5% resistive 
2.0 V minimum 




DDR No. 3288399- V027 

Data required: 

Guidance subsystem waveform parameter data (minimum amplitude 
COOrdinated tem P° rar y amplitude band currently on 

Data supplier 

SP-20 12/guidance telemetry steering committee 

Data user(s): 

SP-2732/launch vehicle telemetry contractor/interface coordinator 

Date due: 

45 days following guidance preliminary design review 


Figure A.3. Typical design data required for table A.2. 
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Appendix B 

Mechanical/Physical Interface Examples 


B.l Mechanical Interface for 
Distributed Electrical Box 

Figure B.l is an example of an interface development docu- 
men r (IDD) that, from initial inspection, appears to be fairly 
complete. This figure contains a great amount of detail and just 
about everything appears to be dimensioned. However, closer 
examination will reveal serious shortcomings. 

First, the basic function of the interface must be defined . 1 he 
box depicted must be capable of being removed and replaced on 
orbit in many cases outside the crew habitat. In some cases it 
is to be removed and replaced robotically. The box slides along 
the L-shaped bracket held to the support structure by three 
mounting bolts labeled “bolt 1 ,” “ bolt 2,” and “bolt 3.” As the 
box slides along the L-shaped bracket from left to nght in the 
fioure, some piloting feature on the box connectors engages the 
connectors mounted to the support structure by the spring- 
mounted assembly, and the connector engages fully when the 
lead screw is completely engaged. 

1 The initial interface area to be examined is that of the 
L-shaped bracket to the support structure (i.e., the interface of 
the three mounting bolts). The interface is being examined from 
the perspective of the designer of the support structure. Does 
figure B.l contain enough information for a mating interface to 
be designed? (The area of interest has been enlarged and is 

presented as figure B.2.) 

a. The dimensions circled in figure B.2 and lettered a, b, 

c, and d locate the position of the mounting bolts 
relative to the box data. The following pertinent 
differences are noted concerning this dimensioning, 
i Dimension a locates the holes relative to a “refer- 
ence datum for coldplate support structure,” but 
the datum is not defined on the drawing. Is it a line 
or a plane? What are the features that identify /locate 
the datum? What is the relationship of this datum to 
other data identified on the IDD (data A, B, and D)? 
This information is required so that the designer 
of the support structure can relate his or her 
interface features easily to those of the box IDD. 

ii. The IDD states that the tolerances on three-place 
decimals is ±0.010. Dimensions a, b, c, and d 
are three-place decimal dimensions and would, 
therefore, fall under this requirement. Elsewhere on 
the IDD a true position tolerance for bolt locations 
is indicated. A feature cannot be controlled by both 
bilateral and true positioning tolerancing. It must be 


one or the other. Considering the function of the 
mounting bolts — to locate the box relative to the 
electrical connectors, it has to be assumed 
that dimensions a, b, c, and d are basic dimensions. 
Interface control drawings cannot require the 
designer of the mating interface to assume any- 
thing. IDD’s must stand by themselves, 
b. Figure B.3 depicts initial details of mounting bolts 
for the L-shaped bracket. On first inspection there 
appears to be a great amount of detail. However, further 
examination shows that much of the detail is not related 
to interface definition. The interface is the bolt. Where 
is it relative to other features of the box? What is the 
relationship of bolts 1 and 2 to bolt 3 (datum C)? 
What is the thread of the bolt? How long is the bolt. 
The following data on the IDD are not required: 

i. Counterbore for bolt head 

ii. Diameter of bolt hole in bracket for bolts 1, 2, 

and 3 

iii. Distance of bolt hole to first thread 

iv. The fact that there is a screw retaining ring 
Adding data not required for the interface, even if they 
are only pictorial, is expensive. It takes time for the 
organization to develop and present it, and it takes 
time for the designer of the mating interface to deter- 
mine that the information is not necessary and discard 
it If the extraneous information stays on the IDD , it 
must be maintained (i.e., changed if the design details 
change). Only the features of a design that affect the 
features of the design of the mating interfaces need 
be placed on the IDD. 

c Once the unnecessary data are removed, what remains 
is shown in figure B.4. The data that remain are not 
complete and are unclear. The true position notations 
are indicated as being those for the “mounting inter- 
face for bolt,” suggesting that the true position applies 
to the hole in the support structure. However, since the 
IDD is basically covering the features of the box, it is 
assumed that these locations apply to the bolts on the 
box. It should not be necessary to have to make 
assumptions about data on an IDD or ICD. The 
document should stand by itself. 

The only other data left in figure B .4 are the callouts for 
the locking inserts. These callouts refer to the method 
used by the designer of the support structure for retaining 
the bolts. This IDD should not have this callout, since the 
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Figure B.I.— Partial interface development document 
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Figure B.4.— Necessary details of mounting bolts. 
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Figure B.5.— Minimal interface definition. 


method used for retaining the bolts is not the responsibil- 
ity of the box designer. Generally IDD’s and ICD’s 
should not specify design solutions, especially when 
the design solutions are not the responsibility of the 
one specifying them. 

What is missing is how far the bolts protrude from the 
box. These data are required so that the designer of the 
support structure knows how deep to make the mating 
hole and how much of a mating thread must be supplied 
to grip the bolts on the box. 

Considering all of the above, figure B.5 represents 
what is really required (along with the locations and 
thread types already defined in fig. B. 1 ) to define the box 
side of the interface and for the designers of the support 
structure to design a compatible interface between the 
retaining bolts and the support structure. 


• 2 . The " eX ‘ 3163 *° 1,6 examin ed is that of the connector 
interface. Since both parts of the connector are being provided 
by the box designer, the interface is the plate on which the 
connectors are attached to the support structure. Again, the 
question is. Does figure B.l contain enough information for a 
~ 3 ~ m ® '"terface to be designed? The answer to that question is, 
Definitely not! The interface of the plate (holding the connec- 
tors) that mates with the support structure is identified as datum 

. Again, there is no definition of this datum. Is it a plane 
passing through the three highest points of the plate or some 
other features of the connector plate? 

If a compatible mating interface is to be designed the 
relationship between the surface to which the connector plate is 
attached and the surface to which the L-shaped bracket is 
attached must be known. None of these data are supplied in 

figure B.l. The following are data needed to establish this 
relationship: 
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a. The required perpendicularity of D to A 

b. The required parallelism of D to B 

c. Therequiredangularrelationshipofthe vertical centerline 

shown in view B-B with the vertical centerline shown in 
view A-A 

d. The pattern required for the four fasteners holding the 
connector plate to the support structure. View B-B does 
contain a dimension of 2.594 for a horizontal spacing of 
the lower two features but does not indicate that this 
dimension is applicable to the upper two fasteners. In 
addition, there is no dimension for the distance between 
the fasteners in the Z direction. 

e. The required relationship of the hole pattern for the 
connector plate relative to the box, namely, 

i. The location of the hole pattern above A in the Z 
direction 

ii. The location of the hole pattern relative to C in the 
X direction 

iii. The distance of datum D from C in the Y direction 
when the box is fully installed 

Since none of these data are identified as items to be determined 
(TBD’ s), it must be assumed either that the data are not required 
because the connectors can be mated properly with a great deal 
of misalignment or that the box designer did not recognize that 
this type of data is required. Designers never wish to freeze a 
design. The placement of design constraints in an ICD is 
basically freezing an area of a design or at least impeding the 
ability to change a design without that design being scrutinized 
at another level. Therefore, the tendency of designers is to 
disclose the minimum that they feel is necessary in the 
interface for the control process. This is the primary reason 
for the ICD custodian not to be organizationally a part of 
the design process. Yet the ICD custodian must have access to 
the design function of an agency or contractor organization to 
ensure the ready flow of the data required for proper interface 
definition. (Can interface compatibility be demonstrated from 

the ICD’s alone?) . 

The ICD custodian must always test the data in interface 
documentation from the viewpoint of another design agent who 
must develop a compatible mating interface. 

The preceding discussion simplifies specification of the 
L-shaped bracket and the mounting bolts. This redefinition of 
the interface tied up loose ends and provided needed dimen- 
sions and callouts absent from the original document. These 
portions of the document can now be controlled more easily and 
related to a 100% mate design. 


B.2 Space Reservation and Attachment 
Features for Space Probe Onboard 
Titan IV Launch Vehicle 

Figure B.6 is an example of an ICD that defines the space 
envelope available onboard the Titan IV launch vehicle for a 
payload and the attachment feature details for the launch 
vehicle side of the interface. The intended payload is the 
Cassini Mission spacecraft. The Titan payload fairing, as 
would be expected, is defined. The other side of this envelope 
(i.e., the spacecraft) must also be defined to show compatibility. 
When the spacecraft dimensions are established, compatibility 
should be shown by a comparison of the two envelopes. The 
Titan documentation defines the available space reserved for 
equipment (i.e., a stay-out zone for the Titan launch vehicle 
items). Ideally, this ICD should define a minimum space 
available for the spacecraft. Therefore, if the spacecraft dimen- 
sions are constrained to a maximum size equal to the launch 
vehicle’s minimum, less a value for environmental effects, etc., 
then the two envelopes are compatible. 

Since interface data have been provided for the attachment 
details for the launch vehicle side of the interface, the design of 
the Cassini adapter for mounting to the Centaur launch vehicle 
at station -150. 199 can be explained by using the Titan design 

data. .. . 

The following key interface features have been established 

for this connection: 

1. Sheet 1 (fig. B.6(a)), note 5: Location of holes is estab- 
lished by a common master gauge tool with reference dimen- 
sions provided. 

2. Sheet 3 (fig. B.6(c)), section F-F: Bearing areas are to be 
flat within 0.006 (units), and per view G the maximum bearing 
area has been defined. 

3. Sheet 3 (fig- B.6(c)), view H: Shape and dimensions of the 
shear alignment pins have been established. 

4. Sheet 1 (fig. B.6(a)), note 4: How loads are to be transmit- 
ted is indicated. 

The following data elements missing from figure B.6 are 
mostly related to the lack of spacecraft design data: 

1. No apparent tracking of TBD’s. A tracking system 
should be in place at the beginning of ICD development 
Each TBD should have a unique sequential identifier with 
due dates and suppliers established. 

2. No revision block for tracking the incorporation of changes. 

Some type of revision record should be placed on each sheet. 
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Upon exchange of design data relating to the Cassini probe it 
would be expected that the probe’s maximum envelope would 
be established and related to the data system of the Titan/ 
Centaur launch vehicle. 

This example is basically a one-sided interface. The Titan/ 
Centaur side of the interface is well defined, which is to be 
expected considering the maturity of the design. The tendency 
should be resisted, in cases like this, to ignore or place less 
emphasis on the definition and documentation of the mating 
interface, given the completeness of the launch vehicle side of 
the interface. The mating interface, namely, the spacecraft side, 
should be completely defined. Otherwise, the spacecraft de- 
signer will be signing up to design a compatible interface by 
agreeing with what the interface on the launch vehicle side 
looks like. Although this approach allows freedom to go off and 
“do independent things,” it lacks the degree of positive control 


needed for interface compatibility. The chances for an incom- 
patibility are much less if the spacecraft side of the interface is 
defined. Space vehicle data, stations, and fasteners must be 
identified and controlled. The designer of the space vehicle is 
then able to commit to the design and production of an interface 
that is defined. The launch vehicle designers can then verify 
that the spacecraft interface will mate with the launch vehicle 
available for the spacecraft. Therefore, if the spacecraft dimen- 
sions are constrained to a maximum size equal to the launch 
vehicle’s minimum, less a value for environmental effects, etc., 
then the two envelopes are compatible. 

Since interface data have been provided for the attachment 
details for the launch vehicle side of the interface, the design 
of the Cassini adapter for mounting to the Centaur launch 
vehicle at station -150. 199 can be explained by using the Titan 
design data. 
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Appendix C 

Software Interface Example: 

Definitions and Timing Requirements for Safety 
Inhibit Arm Signals 


Signal definition 

Centaur sequence 
control unit 
switch number 

Initiating event + time 

Persistence 

Function 

Satellite vehicle (SV) 
pyro unshort 
(primary) 

45 

Main engine cutoff 
(MECO) 2 + 3±0.5 sec 

3 ±0.5 sec 

Unshorts SV pyro capacitor banks 

SV latch valve 
arm (primary) 

33 

MEC02 + 10±0.5 sec 

3±0.5 sec 

Arms safety inhibit relay for SV 
main engines 

SV pyro unshort 
(secondary) 

89 

MEC02 + 15±0.5 sec 

3±0.5 sec 

Provides redundant unshort of SV 
pyro capacitor banks 

SV latch valve 
arm (secondary) 

88 

ME0O2 + 17+0.5 sec 

3±0.5 sec 

Provides redundant arm of inhibit 
relay for SV main engines 

Radiofrequency 
monopropellant driver 
backup enable 

34 

Titan IV/Centaur 
separation + 24±0.5 sec 

3±0.5 sec 

. 

Services backup (redundant to SV 
ground support equipment com- 
mand) enable of safety inhibit SV 
functions (radiofrequency sources 
and reaction control system thruster 
drivers) 
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Appendix D 

Supplied Services Interface Example 


This appendix provides a simplistic text-based example of a 
supplied services (air-conditioning and cooling water) inter- 
face control document with a typical design-data-required 
(DDR) block. This example contains elements condensed from 
a number of service documents originally used for a submarine 
weapons program; however, the principles contained herein are 
universally applicable to any complex system of interfaces. 
Page 1 of the ICD lists the DDR’s (table D.l) showing DDR 


numbers, location on the drawing, brief description, and due 
date. The DDR block (fig. D. 1 ) on the drawing expands on this 
information and identifies supplier, user, and time urgency of 
the data needed. The DDR numbering convention used here is 
“V09 = Void #09.” Preceding the void number with the ICD 
number provides a program-unique DDR number that is easily 
related to its associated ICD and easily maintained in a data 
base. 


TABLE D.L— DESIGN-DATA-REQUIRED SUMMARY 

and locator 


Void 

number 

Location 

Description 

Date due 

VOL 








V09 

Sheet 1, 
zone C-7 

Main heating 
and cooling 
(MHC) water 
schedule 

30 Days after 
authentication of 
data fulfilling 
DDR 5760242-V12 






DDR No. 1466134-V09 

Data required: 

Heating and cooling (HC) system upper zone 
water schedule (supply water temperature versus 


environmental temperature) 

Data supplier: 

HC working group 

Data user 

Launch vehicle design agent 

Date due: 

30 days after authentication of data fulfilling DDR No. 


25431 50-V1 2 


Figure D.l . — Typical design-data-required block. 
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The following pages present the kinds of data required to fully 
define the air-conditioning requirements for suites of equip- 
ment located in a launch control center. Table D.2 details 
conditioned-air distribution; table D.3 presents typical inter- 
face data required to ensure that a cooling water service is 
provided to electrical eq uipment and indicates requirements for 
the equipment before and after the incorporation of an engi- 
neering change. 

701. Launch vehicle control center services; 

A. Air-conditioning shall be provided with a dedicated 
closed-circuit system capable of supplying a mini- 
mum total flow of 12 820 scfm with a 50% backup 
capability. 

1. The conditioned air shall be distributed to each 
equipment flue as specified in table D.2. The distrib- 
uted conditioned air at the inlet to the equipment 
shall satisfy the following parameters: 

a. Temperature: The minimum temperature shall be 
65 °F and the maximum, 70 °F. 

b. Humidity: The maximum humidity shall be 75 
grains per pound of dry air. 

c. Working pressure: The working pressure shall 
be enough to overcome equipment pressure drops 
and to maintain positive pressure at the equip- 
ment outlet with respect to compartment ambi- 
ent pressure. A 10% minimum leakage rate in the 
compartment shall be assumed. 

d. Flow resistance: The system shall be able to over- 
come the pressure drop across the equipment (i.e., 
from exit of orifice plate to top of equipment) as 
shown in table D.2. 

e. Flow profile: 

(1) The flow distribution for each flue shall be 
such that the flow velocity between the flue 
centerline and 1.3 in. from the edge of the flue, 
and (where equipment permits) 6 in. above the 
flue gasket, shall not be less than 80% of the 
achieved average flow velocity. The achieved 
average flow velocity must equal or exceed veloc- 
ity based on the minimum flow rate specified in 
table D.2. 

(2) Velocity profiling is not required for flues 
designated 301 through 310, 011 through 015, 
446BC, 405-2 A, 405-2B, 405-6 A, and 405-6 b! 

f. Adjustment capability: The system shall provide 
flow adjustment from 0 to 300 scfm at each of the 
equipment flues requiring velocity profiling. 


g. Air quality: Air at the inlet to the equipment shall 
be equivalent to or better than air filtered through 
a 0.3-pm filter with an efficiency of 95%. 

2. The closed-loop system shall have the capacity of 
removing 52.8 kW (minimum) of heat dissipated by 
equipment using closed-circuit conditioned air. This 
heat load includes 1.3 kW reserved for launcher 
equipment in the launch vehicle control center (see 
note 702 below). 

702. The system shall provide the capability of removing 
1 .65 kW minimum of heat dissipated by equipment by using 
compartment ambient air as a cooling medium while maintain- 
ing the compartment within specified limits. 

A. The ship shall take no action that eliminates the option 
for launcher equipment to use compartment ambient air 
or closed-circuit conditioned airfordissipating launcher- 
generated heat of 1.3 kW. 

B. Heat dissipated to ambient air by equipment using 
closed-circuit conditioned air is not included. 

703. The system shall provide distribution trunks to equip- 
ment flues with total flow capacity as designated below for the 
conditions of table D.2: 


Trunk 

Minimum 

flow, 

scfm 

A 

2700 

B 

1620 

C 

2300 

D 

3400 

E 

1300 

F 

1500 


704. Flow at reference designations marked with an asterisk 

m table D.2 are to be considered flow reserve capabilities. 
These designated flues do not require verification of flow per 
table D.2 nor profiling per note 701.A.l.e(l) until these flues 
are activated. The Government-furnished pipe assemblies and 
caps will be supplied for flues not activated. 

705. The minimum flow for flues 446BC and 447BC is 
100 scfm before change 30175 and 250 SCFM after change 
30175. 
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TABLE D.2. — CONDITIONED-AIR DISTRIBUTION 


l AbLt U. 
Equipment 

(s 

i. l 

Trunk 
>ee note 
703) 

Flue 

Minimum Bow resistance/ 

flow, pressure drop at 

scfm minimum flow (see 

note 701A.1 d). 
in. H,0 

Data cabinets 

A 

301B 

225 054 



301C 

260 



305 B 

80 50 



305C 

80 50 



306B 

290 56 



306C | 

50 50 

Data console 

A 

308B 

100 50 



308C 

50 50 



309 

o* — 



310B 

135 50 



310C 

50 -50 

Control console 

E 

405-2A 

100 10 



405-2B 

100 



405-6A 

50 



405-6B 

50 

Power buffer and 

B 

Oil 

440 2 0 

conversion 


012 

440 



013-1 

150 



013-2 

150 



015 

440 

Control computer 

D 

440BC 

200 1 0 



440-441 D 

300 

sroup 


444BC 

300 



444-445 D 

250 



446BC 

See note 



447BC 

705 


E 

471 

200 

Control group 

E 

450BC 

200 



450-45 ID 

200 



451BC 

100 


C 

452BC 

200 



452-453D 

200 



458BC 

200 



458-459D 

200 



459BC 



E 

472 

hh 

Power distribution 

F 

002BC 




0O3BC 




004BC 

150* 



004D 

150* 

Load 

F 

271BC 

275 1 0 



271D 

0* 0 



005BC 

100* 10 



005D 

0* 0 


♦Flow reserve capability. 
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TABLE D.3.— WATER FLOW RATE INTERFACE PARAMETERS 


: test. 


Function 

Minimum cooling 
capability 

Water flow rate 

Electrostatically supported 
gyro navigator (ESGN) and 
gravity sensor system (GSS) 
binnacle cooling 

2.25-kW gain 

^.O-gal/min nominal total flow for two 
ESGN binnacles and one GSS binnacle. 
The supply shall maintain constant flow 
of 2.0 gal/min ±10% to each binnacle. 

b A remote, low-flow alarm shall be pro- 
vided for the ESGN binnacles and the 
GSS binnacle. 

Reserve capability for future 
navigation development 

3.25-kW gain 

[ 2.6-gal/min minimum 

ESGN binnacle cooling 

I.5-kW gain 

M.O-gal/min nominal total flow for two 
ESGN binnacles. The supply shall main- 
tain a constant flow of 2.0 gal/min ±10% 
to each binnacle. 

b A remote, low-flow alarm shall be pro- 
vided for the ESGN binnacles. 

Reserve capability for future 
navigation development 

a The system shall provide test co 

4.0-kW gain 
nnections at the inlet ; 

4.5-gaI/min minimum 

and nnflpr nf Ao/'h — i . .. 


Remarks 


Reliability of water supply shall support a navigation 
suteystem availability of 0.97. This service requirement 
shall be continuously available during patrol and refit 
The water temperature shall not vary by more than 
IT when cha "Sing at the rate of 0.25 °F/sec maximum. 
This change shall not occur more than once per 30-min 
period. 
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Appendix E 

Compatibility Analysis 


E.l Definition 

Compatibility analysis of the interface definitions contained in 
an ICD is a major tool of interface control. It serves a two o 
purpose: 

1 . Demonstrates completeness of interface definition. If any 
interface data are missing or presented in a manner that cannotbe 
integrated by using the ICD alone as a data source, the ICD is 

considered deficient. , , , 

2. Provides a record (traceability) that the interface has been 
examined and found to have the right form and fit. This record 
can then be used in evaluating the acceptability of subsequent 
change proposals. 


E.2 Kinds of Data 

The following compilation identifies the kinds of data that 
must be obtained for a compatibility analysis and outlines the 
oeneral steps that should be followed for three categories of 
interface: electrical/functional, mechanical/physical, software, 
and supplied services: 

I. Interface category — electrical/functional 
A. Data required to perform analyses 

1. The following parameters are required, considering 
the specific function or signal involved. 

a. Cabling and connectors 

b. Power requirements 

c. Electromagnetic interference, electromagnetic 
compatability, electromagnetic radiation, and 
grounding requirements 

d. Functional flow and timing requirements 

e. Signal definition 

f. Digital data definition to the bit level 

g. Protocol levels 

h. Seven-layer International Standards Organization 
open systems instruction stack definition or its 
equivalent 

i. Error recovery procedures 

j. Startup and shutdown sequences 

k. Adequacy of standards used or referenced 

2. Unique requirements for an interface or a piece of 
equipment different from overall system require- 
ments (i.e., the hierarchy of specifications required) 

3 Adequate definition of all signals crossing the inter- 
face. “Adequate” is difficult to define precisely but 


depends on the signal type (e.g., analog or digital) 
and the intended use. In general, the interface must 
show the characteristics of the isolating device (ele- 
ment) on each side of the interface and define the 
signal characteristics in engineering terms suitable 
for the particular type of signal. 

4. Timing and other functional interdependencies 

5. System handling of error conditions 

6. Full definition of any standards used. Most digital 
transmission standards have various options that 
must be selected; few, if any, standards define the 
data that are passed. 

B. Steps to be followed 

1 . Verify interoperability of connectors. 

2. Size cables to loads. 

3. Determine cable compatibility with signal and envi- 
ronmental conditions. 

4. Define data in one document only. 

5. Determine adequency of circuit protection devices 
and completeness of signal definition. 

II. Interface category — mechanical/physical 
A. Type of interface — form and fit 

1 . Data required to perform analysis 

a. A datum (reference) that is common to both sides 
of the interface (e.g., a mounting hole in one part 
that will mate with a hole or fastener in the other 
mating parts or a common mating surface of the 
two mating parts) 

b. Dimensions and tolerances for all features of each 
part provided in a manner that gives the optimum 
interface fit and still provides the required design 
functions. Optimum interface means dimension- 
ing so that the tolerance accumulation is kept to a 
minimum. 

2. Steps to be followed 

a. Start with the common datum and add and subtract 
dimensions (adding the tolerance accumulations 
for each dimension) for each feature of the part 
interface. 

b. Determine the dimensional location of the 
interface-unique features by adding and subtract- 
ing the tolerance accumulations from resulting 
dimensions to achieve the worst-case maximum 
and minimum feature definitions. 

c. Perform the same analysis for the mating features 

of the interfacing part. 

d. Compare and question the compatibility of the 
worse-case features of the two mating parts (Will 
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the maximum condition of one part fit within the 
minimum condition of the mating part?) 

B. Type of interface — structural load 

1 . Data required to perform analysis 

a. A description of the loading conditions (static or 
dynamic) and the duration of those conditions 

b. Characteristics of the equipment involved: weight 
or mass; mass distribution; elastic properties; and 
sensitivity of elastic properties to temperature, 
moisture, atmospheric gas content, pressure, etc. 

2. Steps to be followed. This analysis involves placing 
the interfacing items in a position that produces the 
maximum loads while the items are interfacing. A 
space experiment is primarily designed for flight 
loads, yet it must withstand the loads developed 
during the launch and deployment cycles and per- 
haps unique loads during launch processing. The 
complexity of the compatibility analysis will vary 
depending on the types of interfacing items and 
environments. 

a. Attachment loads are the simplest, being a state- 
ment of the loads applied by the attaching feature 
(bolt) and the load capability of the component 
being retained (flange). 

b. Hoisting and handling loads require the calcula- 
tion of bending moments or shear for various 
loading scenarios. Dynamic and environmental 
loads must also be considered. (How quickly is the 
load applied? What are the wind loading factors?) 

c. A more complex situation will be the loads devel- 
oped during a dynamic interaction of interfacing 
items where different material characteristics must 
be considered along with the reaction characteris- 
tics of the materials (e.g., a flexible beam of 
varying moments of inertia supported by an elas- 
tomeric medium where the entire system is 
subjected to a high-velocity impulse of a few 
microseconds duration). Such a condition could 
produce loads that exceed those for which one of 
the interfacing items is designed. Another inter- 
facing item may have to be redesigned so as not to 
jeopardize the mission of the primary item (i.e., 
increasing the strength of the item being supported 
could increase the weight). 

III. Interface category — software 

A. Type of interface — software. The ICD is required to 
specify the functional interface between the computer 
program and any equipment hardware with which it 
must operate. Often, the supplier documentation for 
standard computer peripherals and terminals is ad- 
equate for this purpose. Conversely, it has been found 
that performance specifications governing the design 
of new equipment are not satisfactory for use in a 


functional ICD. The purpose of an ICD is to communi- 
cate equipment interface requirements to programmers 
in terms that the programmers readily and accurately 
understand and to require equipment designers to con- 
sider the effect of their designs on computer programs. 
B. Type of interface — hardware/software integration. The 
ICD provides an exact definition of every interface, by 
medium and by function, including input/output 
control codes, data format, polarity, range, units, bit 
weighting , frequency, minimum and maximum timing 
constraints, legal/illegal values, accuracy, resolution, 
and significance. Existing documentation may be ref- 
erenced to further explain the effect of input/output 
operations on external equipment. Testing required to 
validate the interface designs is also specified. 

IV. Interface category— supplied services 
A. Type of interface— fluid service 
1 . Data required to perform analysis 

a. Type of fluid required by the equipment and 
type of fluid the service supplier will provide. 
This may be in the form of a Federal or military 
specification or standard for both sides or for 
one side of the interface. 

b. Location of the equipment/service interface 
(hose connection, pipe fitting, etc.) 

c. Equipment requirements at the interface loca- 
tion in regard to characteristics (pressure, tem- 
perature, flow rate, duty cycle, etc.) 

d. Capability of the service supplier at the interface 
location 

e. Manner in which the equipment can affect the 
capability of the service supplier (e.g., having a 
large backpressure that the supplier fluid must 
push against or a combination of series and 
parallel paths that the supplier fluid must pass 
through) 

2. Steps to be performed. Examine the supplier and 
equipment requirements to determine 

a. If the supplier capability meets or exceeds the 
equipment requirements. This may require con- 
verting a Federal/military specification or stan- 
dard requirement into what is specified for the 
equipment. 

b. If the supplier capability meets the require- 
ments, considering the effects resulting from the 
fluid passing through the mating equipment 

B. Type of interface — environmental 
1 . Data required to perform analysis 

a. Conditions required for equipment to function 
properly. Storage, standby, and operating 
scenarios need to be established and defined. 

b. Supplier’s capability to provide the environ- 
ment specified in terms of time to reach steady 
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state from transients resulting from uncontrol- 
lable external environments; the limits of the 
steady-state conditions (maximum/minimum); 
and monitoring features 

2. Steps to be performed. Perform analyses (e.g., 
thermal) under extreme and nominal environmen- 
tal conditions to verify that supplier’s equipment 
can maintain the environment required for the 
equipment. The complexity of the analysis may 
vary depending on the types of items involved. 


a. Simple inspection, which considers the environ- 
ment required by an item versus the capability of 
the ambient in which the item resides 

b. Complex analysis, which must consider uncon- 
trolled external environmental inputs, the ther- 
mal properties of intermediate systems that do 
not contribute to the end environment but act as 
conduits or resistors in the model, and the inter- 
action of the item and the system that controls 
the desired environment 
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Appendix F 

Bracket System for Interfaces 


Brackets are used on hardware/engineering drawings to flag 
or identify details controlled by the ICD. Changes cannot be 
made to the drawings or designs without the effects on the 
interface being assessed and coordinated through the ICD 
process. 

The process uses a rating similar to that used in the problem/ 
failure reporting bracket system with the same controls and 
traceability. Once a bracket has been assigned to an interface 
void or problem, specific analyses and actions are required for 
the bracketed item to be removed. The bracketed item remains 
in open status with assignment to the responsible cognizant 
subsystem or design section until (1) the corrective action or 
coordinated information has been developed, (2) a proper risk 
assessment has been performed, (3) ICD change actions have 
been completed, (4) adequate verification of the interface is 

planned, and (5) the proper approval signatures have been 
obtained. 

The following ratings are used to establish a category of 
bracket” identifiers for interface deficiencies. Any discrep- 
ancy having an A rating greater than 1 or a B rating greater than 
2 will be designated a bracketed discrepancy (see figure F.l). 

I. Interface deficiency rating A (S&MA impact) 

A. Rating A1 : Negligible effect on interface or mission 

performance 

1 . No appreciable change i n functional capability (form, 
fit, and function are adequate for the mission) 

2. Minor degradation of engineering or science data 

3. Support equipment or test equipment failure but not 
mission-critical element failure 

4. Support-equipment- or test-equipment-induced 
failures 


5. Drawing errors not affecting element construction 

B. Rating A2: Significant degradation to interface or 
mission performance 

1. Appreciable change in functional capability 

2. Appreciable degradation of engineering or science 
data 

3. Significant operational difficulties or constraints 

4. Decrease in life of interfacing equipment 

5. Significant effect on interface or system safety 

C. Rating A3: Major degradation to interface or mission 

performance or catastrophic effect on interface or 
system safety 

II. Interface deficiency rating B (understanding of risk) 

A. Rating B 1 : Effect of interface deficiency is identified 
by analysis or test, and resolution or corrective 
action is assigned and scheduled or implemented 
and verified. There is no possibility of recurrence. 

B. Rating B2: Effect of interface deficiency is not fully 
determined. However, the corrective action proposed, 
scheduled, or implemented is considered effective in 
correcting the deficiency. There is minimal possibility 
of recurrence and little or no residual risk. 

C. Rating B3: Effect of interface deficiency is well 
understood. However, the corrective changes pro- 
posed do not completely satisfy all doubts or concerns 
regarding the correction, and the effectiveness of 
corrective action is questionable. There is some poss- 
ibility of recurrence with residual risk. 

D. Rating B4: Effect of interface deficiency is not well 
understood. Corrections have not been proposed or 
those proposed have uncertain effectiveness. There is 
some possibility of recurrence with residual risk. 
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Interface discrepancy red flag; 
j project or task manager approval required 


Rating A 
(S&MA impact) 

Numerical rating 

Rating B 

(understanding of risk) 

Negligible 

impact 

1 

1 

Known deficiency with corrective action 
assigned, scheduled, and implemented 

Significant 

degradation 

2 

2 

Deficiency poorly defined but acceptable 
corrective action proposed, scheduled, and 
implemented (low residual risk) 

Major 

degradation 

3 

3 

/ - . : | 

Known deficiency but effectiveness of 
corrective action is unclear and does not 
satisfy all doubts and concerns (residual risk) 



4 

Impact not defined with confidence; 
corrective action with uncertain 
effectiveness (residual risk) 


Figure F.1 .—Interface deficiency rating system. 
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Appendix G 

ICD Guidelines 

1 . Interface control documents should not require the designer of the 
mating interface to assume anything. ICD’s should be compatible with 
each other and stand alone. 

2. Only the definition that affects the design of the mating interfaces 
need be used. 

3. ICD’s should not specify design solutions. 

4. The ICD custodian should be independent of the design organiza- 
tion. 

5. 
an I< 

interface described by the ICD. 

6. An interface control system should be in place at the beginning of 
system (hardware or software) development. 

7. Each void should have a unique sequential identifier establishing 
due dates, identifying exact data to be supplied, and identifying the data 
supplier. 


The ICD custodian should verify that the data being controlled by 
CD are sufficient to allow other organizations to develop the 
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Appendix H 

Glossary 


baseline— The act by which the program manager or a desig- 
nated authority signs an interface control document (ICD) and 
by that signature establishes the genuineness of the ICD as an 
official document defining the interface design requirements. 
The term “baseline” conveys the idea that the ICD is the only 
official definition and that this officiality comes from the 
technical management level. Not only is the initial version of 
the ICD baselined, but each change to an ICD is likewise 
approved. 

comment issue — An issue of an ICD distributed for review and 
comment before a meeting of the affected parties and before 
baselining 

custodian— The contractor or project assigned the responsibil- 
ity of preparing and processing an ICD through authentication 
and subsequently through the change process 

data — Points, lines, planes, cylinders, and other geometric 
shapes assumed to be exact for the purpose of computation and 
from which the location or geometric relationship (form) of 
features of a piece of equipment can be established 

interface responsibility matrix — A matrix of contractors, 
centers, and project organizations that specifies responsibilities 
for each ICD listed for a particular task. Responsibilities are 
designated as review and comment, technical approval, 
baselining, and information. 

electrical/functional interface — An interface that defines the 
interdependence of two or more pieces of equipment when the 
interdependence arises from the transmission of an electrical 
signal from one piece of equipment to another. All electrical 
and functional characteristics, parameters, and tolerances of 
one equipment design that affect another equipment design are 
specified. 

interface — That design feature of one piece of equipment that 
affects a design feature of another piece of equipment. An 
interface can extend beyond the physical boundary between 
two items. (For example, the weight and center of gravity of 
one item can affect the interfacing item; however, the center of 
gravity is rarely located at the physical boundary. An electrical 
interface generally extends to the first isolating element rather 
than terminating at a series of connector pins.) 

interface control — The process of (1) defining interface re- 
quirements to ensure compatibility between interrelated pieces 


of equipment and (2) providing an authoritative means of 
controlling the interface design. 

interface control document (ICD) — A drawing or other docu- 
mentation that depicts physical and functional interfaces of 
related or cofunctioning items. (The drawing format is the most 
common means of controlling the interface.) 

interface control working group — A group convened to 
control and expedite interface activity between the Govern- 
ment, contractors, and other organizations, including resolu- 
tion of interface problems and documentation of interface 
agreements 

interface definition — The specification of the features, char- 
acteristics, and properties of a particular area of an equipment 
design that affect the design of another piece of equipment 

interoperability— The ability of two devices to exchange 
information effectively across an interface 

mechanical/physical interface — An interface that defines the 
mechanical features, characteristics, dimensions, and toler- 
ances of one equipment design that affect the design of another 
subsystem. Where a static or dynamic force exists, force 
transmission requirements and the features of the equipment 
that influence or control this force transmission are also de- 
fined. Mechanical interfaces include those material properties 
of the equipment that can affect the functioning of mating 
equipment or the system (e.g., thermal and galvanic 
characteristics). 

software interface — The functional interface between the 
computer program and any equipment hardware with which it 
must operate. Tasking required to validate the interface designs 
is also specified. 

supplied-services interface — Those support requirements that 
equipment needs to function and that are provided by an 
external separate source. This category of interface can be 
further subdivided into environmental, electrical power, and 
communication requirements. 

technical approval — The act of certifying that the technical 
content in an interface document or change issue is acceptable 
and that the signing organization is committed to implementing 
the portion of the interface design under the signer’s cognizance. 
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Training Answers 

Chapter Answers 

1 1 (A); 2(D); 3(C); 4a(C), 4b( A), 4c(B) 

2 1(D); 2(C); 3a(B), 3b(C); 4a(C), 4b(C), 4c(C); 
5a(A), 5b(A), 5c(A); 6a(C), 6b(A), 6c(A); 
7a(B), 7b(B), 7cA(i), 7cB(ii), 7cC(i); 8a(B), 
8b(A); 9a(A), 9b(A), 9c(B) 

3 la(A), lb(B), lc(B); 2a(B), 2b(A), 2c(B); 
3a(A), 3b(B), 3c(A); 4a(B), 4b(A), 4c(B); 
5a(A), 5b(B); 6a(A), 6b(B), 6c(B), 6d(B); 
7a(B), 7b(A), 7c(A), 7d(A), 7e(A) 
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